Streamed processing of input documents

Saxon 8.5 introduces a new optimization, specifically designed for streamed processing of large documents, where the need to allocate sufficient memory to hold the entire source tree is traditionally a problem.

This currently works only in XSLT, and is supported only in Saxon-SA (though the stylesheet does not need to be schema-aware). It involves no new language constructs, but needs to be enabled by means of the extension attribute saxon:read-once="yes". It is invoked by a stereotypical coding pattern that the optimizer recognizes and treats specially.

The typical code is as follows:

<xsl:function name="f:customers"> <xsl:copy-of select="doc('customers.xml')/*/customer" saxon:read-once="yes" xmlns:saxon="http://saxon.sf.net/"/> </xsl:function> <xsl:template name="main"> <xsl:apply-templates select="f:customers()"/> </xsl:template>

It's not necessary for such a stylesheet to have a principal source document, the transformation can be invoked instead using the -it main option from the command line, or its equivalent in the Java API.

The important factors here are:

The xsl:copy-of instruction must be used, with a select expression that starts with a call on the document() or doc() function. The significance of this is that because the instruction is making a copy of the nodes in the external document, there is no requirement that the nodes returned by multiple attempts to access the same document should have the same identity. Saxon therefore doesn't need to include the source document in its in-memory document pool. (The "copy" is purely notional of course, since the original source tree corresponding to the customers.xml document is never materialized.)
The path expression introduced by the call on document() or doc must conform to the rules for path expressions appearing in identity constraints in XML Schema. This means there must be no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis. These restrictions allow Saxon to use the same code for serial XPath processing that is already used for validating identity constraints against a schema (which is one reason the facility is available only in Saxon-SA).
It is not absolutely essential that the xsl:copy-of instruction is used within a function, but this is the easiest way of ensuring that the instruction is executed in pull mode, which is a prerequisite for this optimization to be activated.
The optimization is enabled only if the saxon:read-once attribute is present and is set to "yes", and if the stylesheet is processed using Saxon-SA.
The optimization should not be enabled if the source document is read more than once in the course of the transformation. There are two reasons for this: firstly, performance will be better in this case if the document is read into memory; and secondly, when this optimization is used, there is no guarantee that the document() function will be stable, that is, that it will return the same results when called repeatedly with the same URI.

The implementation of this facility uses multithreading. One thread (which operates as a push pipeline) is used to read the source document and filter out the nodes selected by the path expression. The nodes are then handed over to the main processing thread, which iterates over the selected nodes using an XPath pull pipeline. Because multithreading is used, this facility is not used when tracing is enabled. It should also be disabled when using a debugger (there is a method in the Configuration object to achieve this.)

Note that a tree is built for each selected node, and its subtree. The saving in memory comes when these nodes are processed one at a time, because each subtree can then be discarded as soon as it has been processed. There is no benefit if the stylesheet needs to perform non-serial processing, such as sorting. There is also no benefit if the path expression selects a node that contains most or all of the source document, for example its outermost element.

Saxon can handle expressions that select nested nodes, for example //section where one section contains another. However, the need to deliver nodes in document order makes the pipeline somewhat turbulent in such cases, increasing memory usage.

Serial processing in this way is not actually faster than conventional processing (in fact, it may only run at half the speed). Its big advantage is that it saves memory, thus making it possible to process documents that would otherwise be too large for XSLT to handle. There may also be environments where the multithreading enables greater use of the processor capacity available. To run without this optimization, either change the xsl:copy-of instruction to xsl:sequence, or set saxon:read-once to "no".