XML parsing in SaxonCS
XML parsing in SaxonCS is delegated to
the System.Xml
parser supplied with the .NET platform.
The Microsoft parser has some limitations:
It does not notify ID or IDREF values from the DTD, or expand attributes with fixed or default values, unless DTD validation is requested. This can be requested using the
-v
option on the command line, or via the API. (See theDtdValidation
property of the DocumentBuilder class.)It rejects documents that specify
<?xml version="1.1"?>
, or that use namespace undeclarations in the formxmlns:p=""
.It does not notify unparsed entities to Saxon. The XSLT functions unparsed-entity-uri() and unparsed-entity-public-id() will therefore not work.
It does not notify changes of base URI (for example, at entity boundaries). In principle, Saxon could interrogate the parser to determined the base URI of each element as it is delivered. Currently this is not done, so in the absence of
xml:base
attributes, all elements in a document will have the same base URI, regardless of external entity boundaries.
There are several ways the System.Xml
parser can be used:
In interfaces where a source XML document is provided by supplying a URI, or a
Stream
, or aTextReader
, Saxon will invoke the Microsoft parser to parse the content. In this case the parser operates as a stream-based (pull-mode) parser, notifying parsing events to Saxon as they occur. Saxon may build a tree representation of the document internally, or it may process the data in streamed mode. This is generally more efficient than supplying a DOM.Some interfaces also allow you to supply input in the form of an
XmlReader
. This allows you to control the settings and options applied to theXmlReader
. Some settings may produce behavior that is not conformant with the W3C specifications (for example, switching character checking off), and which could potentially cause Saxon to behave unpredictably.You can also use the
System.Xml
parser to construct an in-memory DOM tree (represented by anXmlDocument
orXmlNode
). Some Saxon interfaces accept input in the form of anXmlDocument
orXmlNode
. There are two ways Saxon can handle a DOM:The DOM can be copied to an internal Saxon tree structure.
Note that there's no point constructing a DOM just so Saxon can rebuild it: it's better to let Saxon parse the XML and build its own tree. However, this option is useful if you are using a DOM for other reasons, for example if your application has other parts that are DOM-based.
The DOM can be wrapped as a Saxon tree. This avoids the cost and memory overhead of copying the tree, but the result is slower to navigate.
(SaxonCS does not currently offer the Domino model, which is a hybrid between copying and wrapping: Domino on SaxonJ uses the DOM as supplied, but adds indexes for fast searching.)
To use URI catalogs and locally-cached documents with the Microsoft parser,
download nuget package Org.XmlResolver
and nominate this as your XmlResolver
before invoking the parser.
No other XML parser is currently supported, although it is in principle possible to plug in a third party
parser provided it is capable of delivering a Saxon XdmNode
or a DOM XmlNode
.
HTML parsing
Parsing of HTML5 documents can be achieved by calling either of the functions
saxon:parse-html() or (if 4.0 extensions are enabled)
fn:parse-html()
. Both forms are identical. The function has changed substantially
between SaxonCS 11 and SaxonCS 12; whereas SaxonCS 11 used HtmlAgilityPack
as the underlying
parser, SaxonCS 12 uses AngleSharp
, which conforms much more closely to the standard HTML5 parsing algorithm.
Saxon parses the supplied HTML content using AngleSharp
, returning a Saxon wrapper around the resulting
AngleSharp.Dom.IDocument
node, from which XPath navigation is possible.
Normal HTML elements (such as <p>
, <div>
, etc) are delivered
as element nodes whose local name is in lower-case, with the XHTML namespace URI.