The PTree File Format
Saxon-PE and Saxon-EE support a file format called the PTree (persistent tree). This is a binary representation of an XML document. The PTree file is generally about the same size as the original document (perhaps 10% smaller), but it typically loads in about half the time. Storing a document as a PTree can therefore give a useful performance improvement when the same source document is used repeatedly as the input to many queries or transformations. Another benefit of the PTree is that it retains any type information that is present, which means that the document does not need to be validated against its schema each time it is loaded. (The schema, however, must be loaded whenever the document is loaded.)
Two commands are available for converting XML documents into PTree files and vice versa. To create a PTree, use:
java com.saxonica.ptree.PTreeWriter source.xml result.ptreeThe option -strip
causes all whitespace-only text nodes to be stripped in the process,
which will often give a useful saving in space and therefore in loading time.
To convert a PTree back to an XML document, use:
java com.saxonica.ptree.PTreeReader source.ptree result.xmlIt is possible to apply a query or transformation directly to a PTree by specifying the -p
option on the command line for com.saxonica.Transform
or com.saxonica.Query
.
This option actually causes a different URIResolver, the PTreeURIResolver
, to be used in
place of the standard URIResolver. The PTreeURIResolver
recognizes any URI ending in the
extension .ptree
as an identifier for a file in PTree format. This extends to files loaded using the
doc()
or document()
functions: if the file extension is .ptree
, the
file will be assumed to be in PTree format.
The result of a query or transformation can be serialized as a PTree file by specifying saxon:ptree
as the output method, where the namespace prefix saxon
represents the URI
http://saxon.sf.net/
.
The PTree format is designed to allow future Saxon releases to read files created using older releases. The converse may not always be true: it might sometimes be impossible for release N to read a PTree file created using release N+1.
In releases up to and including Saxon 9.3, the PTree files were always at version 0. Saxon 9.4 introduces a new
version, version 1. The new version differs in retaining DTD-derived attribute types (ID, IDREF, IDREFS). The
PTreeReader
in Saxon 9.4 (onwards) will read both versions.
The PTreeWriter
in Saxon 9.4 writes version 1 output by default
(which cannot be read by earlier releases), but it can still write original version 0 output
if requested. If called from the command line, use the option -version:1
.
The PTree format does not retain the base URI of the original file: when a PTree is loaded, the base URI is taken as the URI of that file, not the original XML file. The PTree is a serialization of the XPath data model, so information that isn't present in the data model will not be present in the PTree: for example, it will have no DTD and no entity references or CDATA sections.
References to unparsed entities are not currently retained in a PTree.