Handling of source documents
When the source document is supplied as a pre-built tree (in any format), and Saxon strips whitespace
text nodes as requested by the stylesheet, the space stripping now takes account of any xml:space
attributes present in the tree. Specifically, whitespace text nodes are preserved if xml:space="preserve"
is specified. This can be expensive, but is required for conformance. When supplying pre-built trees
as input (whether as DOM, JDOM, or XOM trees, or as native Saxon trees) it is best not to use
xsl:strip-space
in the stylesheet.
When the source document is supplied as a DOM or JDOM tree, multiple adjacent text and CDATA nodes are now mapped to a single text node in the XPath model. If the XPath text node is passed to a Java extension function, the extension function sees the first node in the underlying sequence. This change has not yet been made for XOM trees.
Saxon accepts URIs of the form "document.xml#id" where "id" is the value of an attribute defined in the DTD
as being of type ID. It now also accepts such URIs where the fragment identifier is the value of an xml:id
attribute.
Where a stylesheet is embedded in a source document, or a schema is embedded within a stylesheet, the base URI of the embedded document was previously taken as being the same as the base URI of the containing document. It is now taken as the base URI of the relevant element. This means that the xml:base attribute is taken into account.
In a previous release, following a change in the W3C specifications, Saxon was changed so
that DTD-based types such as ID and IDREF did not set the type annotation on the attribute node.
An unintended consequence of this change was that the idref() function stopped working when an
attribute was defined in the DTD as being of type IDREF or IDREFS. This has now been fixed. Doing so
required some changes to the data model. The is-id and is-idref properties defined in the W3C data model
are not reflected directly in the Saxon implementation, but the information is now available in a slightly
different way. The method getTypeAnnotation()
when applied to an attribute node may
now return a value that contains the fingerprint code for the type xs:ID
, xs:IDREF
,
or xs:IDREFS
together with a high bit (NodeInfo.IS_DTD_TYPE
) indicating
that the type is DTD-derived rather than schema-derived. When this bit is set, the value should be treated
as being untyped atomic, but the type annotation returned indicates whether the is-id or is-idref properties
are present. This same change applies to the type code passed with attributes in the Receiver
and PullProvider
interfaces.
PTree files
Saxon-SA 8.5 allows an XML document to be saved on disk in a format referred to as a PTree. This is a binary format designed for speed of loading. A document in PTree format takes about the same amount of disk space as the original source XML, but takes about half as long to load into memory. The saving is greater when the document contains type information, because this is retained in the PTree without the need to revalidate.
Two new commands are available, com.saxonica.ptree.PTreeWriter
and
com.saxonica.ptree.PTreeReader
to convert XML documents into PTrees and vice
versa.
A PTree can be supplied as the input to a transformation or query using the class PTreeSource
,
which implements the JAXP Source
interface.
A new command-line option is available on the commands com.saxonica.Transform
and com.saxonica.Query
. The option -p
causes a URIResolver to be used
that recognizes the file extension .ptree
as representing a Saxon PTree. This option
implicitly switches on the -u
option, meaning that the source file name is interpreted
as a URI. The PTreeURIResolver
, as well as recognising the .ptree
file extension, also
recognizes query parameters at the end of a URI. In particular it recognizes the parameters
validation=strict
, validation=lax
, validation=strip
which control how a source
document is schema-validated. For example, doc('source.xml?validation=lax')
loads a source
document with lax validation. This option allows different validation to be applied to different source
documents loaded by a single query or transformation.
The result of a query or transformation can be serialized as a PTree by specifying saxon:ptree
as the serialization method
. From the command line, use the parameter
!method={http://saxon.sf.net/}ptree
.
The PTree format has been designed so that one Saxon release should normally be able to read PTree files created by an earlier release. It may not always be possible, however, to read PTrees created using a later Saxon release. The PTree is not dependent on any particular NamePool, and can be freely moved between different machines just as source XML can. It is a binary format, so there is no dependency on any particular character encoding or machine architecture. PTree files are not designed to be read or written directly by user applications, nor are they designed to provide an interchange format between Saxon and other products: the internal format is therefore not published.
When a PTree contains type information, the schema that defines those types must also be loaded. This doesn't happen automatically. At present, there is no way of storing a compiled schema on disk, so this will generally involve rebuilding the schema from its source representation. It is the user's responsibility to ensure that the loaded schema is consistent with the schema that was used to validate the original XML document.
For more information see PTree Files.