Streaming of Large Documents
Saxon 8.5 introduced a new optimization called streaming copy,
specifically designed for processing of large
documents, where the need to allocate sufficient memory to hold the entire source tree is
traditionally a problem.
This currently works only in XSLT, and is supported only in Saxon-SA (though the stylesheet
does not need to be schema-aware). It involves no new language constructs, but needs to be enabled
by means of the extension attribute
saxon:read-once="yes". It is invoked by
a stereotypical coding pattern that the optimizer recognizes and treats specially.
A very simple way of using this technique is when making a selective copy of parts of a document.
For example, the following code
creates an output document containing all the
footnote elements from the source document
that have the attribute
Note the restrictions below on the kind of predicate that may be used.
More typically, the copied nodes will be further processed. For example:
<xsl:value-of select="code, name, location" separator="|"/>
It's not necessary for such a stylesheet to have a principal source document, the transformation
can be invoked instead using the
-it main option from the command line, or its equivalent
in the Java API.
The important factors here are:
xsl:copy-of instruction must be used, with a select expression that starts
with a call on the
doc() function. The significance of this is
that because the instruction is making a copy of the nodes in the external document, there is no requirement
that the nodes returned by multiple attempts to access the same document should have the same identity.
Saxon therefore doesn't need to include the source document in its in-memory document pool.
(The "copy" is purely notional of course, since the original source tree corresponding to the
customers.xml document is never materialized.)
The path expression introduced by the call on
doc must conform
to a subset of XPath defined as follows:
any XPath expression is acceptable if it conforms to the rules for path expressions appearing
in identity constraints in XML Schema. These rules allow
no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally
use the attribute axis; all other steps must be simple Axis Steps using the child axis.
In addition, Saxon allows the expression to contain a union, for example
doc()/(*/ABC | */XYZ).
Unions can also be expressed in abbreviated form, for example
the above can be written as
The expression must either select elements only, or attributes only. It must not select any other kind of
node, and it must not select a mixture of elements and attributes.
Simple filters are also supported. The filter may apply to the last step or to the expression as a whole,
and it must only use downward selection from the context node (the self, child, attribute, descendant,
descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position()
or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it
will not be numeric). Filters cannot be applied to unions or to branches of unions.
Any violation of these conditions causes the expression to be evaluated without the streaming optimization.
It is not absolutely essential that the
xsl:copy-of instruction is used within a function.
Placing it within a variable may be counterproductive, since all the selected data will then be held
in memory. However, this may be viable if the select expression selects only a small subset of the data
in the source document. It is also possible to use the
xsl:copy-of instruction to copy the
selected elements from the source document directly to the result tree, without further processing.
The optimization is enabled only if the
saxon:read-once attribute is present and is
set to "yes", and if the stylesheet is processed using Saxon-SA.
The optimization should not be enabled if the source document is read more than once in the course
of the transformation. There are two reasons for this: firstly, performance will be better in this case if the
document is read into memory; and secondly, when this optimization is used, there is no guarantee that the
document() function will be stable, that is, that it will return the same results when called
repeatedly with the same URI.
The implementation of this facility typically uses multithreading. One thread (which operates as a push pipeline)
is used to read the source document and filter out the nodes selected by the path expression. The nodes are then
handed over to the main processing thread, which iterates over the selected nodes using an XPath pull pipeline.
Because multithreading is used, this facility is not used when tracing is enabled. It should also be disabled
when using a debugger (there is a method in the Configuration object to achieve this.)
In cases where the entire stylesheet can be evaluated in "push" mode (as in the first example above),
there is no need for multithreading: the selected nodes are written directly to the current output destination.
Note that a tree is built for each selected node, and its subtree. Trees are also built for all nodes selected
by the path expression, whether or not the satisfy the filter (if they do not satisfy the filter, they will
be immediately discarded from memory). The saving in memory comes when these nodes
are processed one at a time, because each subtree can then be discarded as soon as it has been processed. There
is no benefit if the stylesheet needs to perform non-serial processing, such as sorting. There is also no benefit
if the path expression selects a node that contains most or all of the source document, for example its outermost
Saxon can handle expressions that select nested nodes, for example
//section where one section
contains another. However, the need to deliver nodes in document order makes the pipeline somewhat turbulent
in such cases, increasing memory usage.
Serial processing in this way is not actually faster than conventional processing (in fact, when multithreading
it may only run at half the speed). Its big advantage is that it saves memory, thus making it possible to
process documents that would otherwise be too large for XSLT to handle. There may also be environments
where the multithreading enables greater use of the processor capacity available.
To run without this optimization,
either change the
xsl:copy-of instruction to
saxon:read-once to "no".