Streaming using XSLT 3.0
Saxon-EE (as at Saxon 9.6) is close to the draft XSLT 3.0 recommendation in terms of the streaming facilities it supports. Specifically, it implements nearly all the streaming features in the Working Draft of 12 December 2013, plus most of the (mainly minor) changes agreed by the Working Group up to September 2014. There are some restrictions and extensions, however.
There are two main ways to initiate a streaming transformation:
- Using the xsl:stream instruction, where the source document is identified within the stylesheet itself. Typically such a stylesheet will have a named template as its entry point, and will not have any principal source document supplied externally.
- By supplying a source document as input to a stylesheet whose initial mode is declared
with
streamable="yes"
in an xsl:mode declaration. In this case the source document must be supplied as aStreamSource
orSAXSource
, and not as an in-memory tree.
The saxon:stream extension
function used in previous releases is still supported in Saxon 9.6 for the time being. The
original Saxon mechanism for streaming, namely the saxon:read-once
attribute on
xsl:copy-of
, is dropped in Saxon 9.6.
The rules for whether a construct is streamable or not are largely the same in Saxon as in the XSLT 3.0 specification. Saxon applies these rules after doing any optimization re-writes, so some constructs end up being streamable in Saxon even though they are not guaranteed streamable in the W3C spec, because the Saxon optimizer rewrites the expression into a streamable form. An example of this effect is where variables or functions are inlined before doing the streamability analysis. In contrast, when streaming is requested, the optimizer takes care to avoid rewriting streamable constructs into a non-streamable form.
This documentation does not attempt to provide a tutorial introduction to the streaming capabilities of XSLT 3.0. The specification itself is not easy to read, especially the detailed rules on which constructs are deemed streamable. However, for the most part it is not necessary to be familiar with the detailed rules. The main things to remember are:
- A construct is "consuming" if it reads a subtree of the source document, that is, if it
makes a downwards selection from the context item. In general, constructs are not allowed
to have two operands that are both consuming. Some exceptions to this are: the xsl:fork instruction; conditional
expressions such as xsl:choose if
each branch only contains one consuming expression; the map expression
map{...}
in XPath and the xsl:map instruction in XSLT. - During a streaming pass, the XSLT processor remembers the ancestors of the context item and all the attributes of ancestors. Path expressions that access the ancestors and their attributes are therefore allowed. However, such expressions should generally return atomic values (for example the values of attributes) rather than returning nodes in the streamed document, because if nodes are returned, the system often can't be sure that there is no disallowed navigation from those nodes (for example, you can't get all the descendants of an ancestor node).
- It's not permitted to bind a streamed node to a variable or parameter, or to pass it to a function.
- An expression such as
//section
is referred to as a crawling expression. Crawling expressions potentially contain nodes which overlap each other, which creates problems if you want to make further downward selections from such nodes. The XSLT 3.0 specification allows this in some circumstances, for example you can pass such an expression to a function that atomizes the result, but other cases (for example, using such an expression in xsl:for-each or xsl:apply-templates) are forbidden. If you know that the expression will never select overlapping nodes (for example, if you know that//title
will never select one title appearing within another title), then you can rewrite the expression asoutermost(//title)
to avoid the restrictions. Saxon also allows overlapping nodes in some contexts where the W3C specification does not, provided streamability extensions are enabled. - When you hit these restrictions, you can often work around them by making a copy of a subtree of the streamed document, for example by using the new copy-of() or snapshot() functions. These are consuming expressions, but the result is "grounded" (that is, an ordinary in-memory tree) so it can be used without any restrictions. Clearly this only works if the subtrees that you copy are small enough to fit in memory.
The XSLT 3.0 constructs most relevant to streaming are:
-
Streamable template rules. XSLT 3.0 has a new xsl:mode declaration, and this allows all the template
rules in a particular mode to be declared streamable (
<xsl:mode streamable="yes"/>
). If a mode is declared streamable, then Saxon checks whether all the template rules in that mode are actually streamable, and reports a compile-time error if not. - The xsl:stream instruction.
This has an
href
attribute which defines the URI of a streamed input document, and the instructions withinxsl:stream
are evaluated with this document as the context node. The body of thexsl:stream
instruction must satisfy the streamability rules; again, any violation is detected at compile time. - The xsl:iterate instruction. This is like an xsl:for-each instruction except that it guarantees to process the selected nodes in order, and the results of processing one node can be passed as a parameter to the next iteration, so the action applied to one node can influence the way in which subsequent nodes are processed. This often provides a solution to the problem that when streaming, you can never "look backwards" at preceding nodes. Instead of looking backwards, the information that will be needed when processing subsequent nodes can be retained in parameters and "passed forwards". Note that streamed nodes themselves cannot be contained in parameters, but data derived from those nodes can.
- The xsl:merge instruction allows several input sequences to be merged, based on the value of a sort key. Any or all of the input sequences can be streamed documents, provided that they are already correctly sorted on the sort key value.
- Accumulators allow values to be computed "in the background" while a streamed document is being read; the final value of the accumulator is available by calling the accumulator-after() function at the end of processing, and intermediate values are also available. Accumulators are useful if you want to compute several values during a single processing pass of a streamed document (for example, a minimum and maximum of some value). When the information to be maintained in the accumulator is complex, it can be useful to hold it in a map, which is a new data structure introduced in XSLT 3.0.
- The xsl:fork instruction
effectively computes several instructions in parallel. In the Saxon implementation, they
are not actually evaluated in different threads, but they are all executed during a single
scan of the streamed input document. The outputs produced by each "prong" of the
xsl:fork
instruction are buffered in memory until all prongs have completed, and are then assembled in the correct order to form the final result. -
Streamed grouping is possible using the xsl:for-each-group instruction, provided that
one of the options
group-adjacent
,group-starting-with
, orgroup-ending-with
is used. There are restrictions on the use of the current-group() function within such an instruction: essentially, it can only be used once, because it is a consuming construct.
All these facilities are available in Saxon-EE only. Streamed templates and accumulators also require XSLT 3.0 to be enabled by setting the relevant configuration parameters or command line options.