Using XPath
This section describes how to use Saxon as a class library for XPath processing from Java or C#, without making any use of XSLT stylesheets. It includes information on the XPath API, and the API for the Saxon implementation of the XPath object model. On other pages you will find the API for XSLT transformation, the API for running XQuery, and the API for schema validation.
For information about the different ways of loading source documents, see Handling XML documents.
Saxon supports two public Java APIs for XPath processing, as follows:
-
The preferred interface for XPath processing is Saxon's s9api interface ("snappy"), which also supports XSLT and XQuery processing, schema validation, and other Saxon functionality in an integrated set of interfaces. This is described at Evaluating XPath expressions using s9api.
-
The JAXP API is a (supposedly) standard API defined in Java 5. Saxon implements this interface. Details of Saxon's implementation are described at JAXP XPath API. Note that there are some extensions and other variations in the Saxon implementation. Some of the extensions to this interface are provided because Saxon supports XPath 2.0 (and higher), whereas JAXP 1.3 is designed primarily for XPath 1.0; some are provided because Saxon supports multiple object models, not only DOM; some are for backwards compatibility; and some are provided to allow applications a finer level of control if required.
-
On .NET, XPath processing is available via classes in the Saxon.Api namespace, as described at Evaluating XPath expressions from C#.
Saxon allows XPath expressions to be evaluated either against its own native tree models of XML (the tiny tree and linked tree), or against trees built using external third-party libraries:
With SaxonJ, the supported tree models are DOM, JDOM, JDOM2, DOM4J, XOM, or AXIOM.
With SaxonCS, the DOM implementation provided by the
System.Xml.XmlDocument
class is supported.
Note that use of a third party tree implementation may have a significant performance overhead compared with Saxon's native tree models; furthermore, most of these implementations are not thread-safe, which can cause problems when Saxon's multithreading capability comes into play.
Namespaces
When XPath expressions use prefixed element or attribute names (such as p:foo//p:bar
) it is
necessary to declare the prefix (here p
) so that the XPath processor knows which namespace it
refers to. It's irrelevant what prefix is used in the source document: the vital thing is that the prefix used
in the XPath expression maps to the same namespace as the prefix (or defaulted prefix) used in the source document.
All the APIs therefore provide a mechanism for binding prefixes to namespaces. Because Saxon supports XPath 3.1, it's also possible to use alternative mechanisms to refer to namespaced elements:
- To match elements regardless what namespace they are in, you can write
*:foo//*:bar
. - To match elements in namespace
http://foobar.com/
without declaring a prefix, you can use the notationQ{http://foobar.com/}foo//Q{http://foobar.com/}bar
.
In XPath 1.0, an unprefixed name in an XPath expression was defined to match no-namespace elements in the source document. If the source document uses a namespace (even if it's the default namespace, used with no prefix), then the corresponding names in the XPath expression need to be prefixed.
XPath 2.0 introduced the idea of a default namespace for elements and types, allowing you to declare that unprefixed element names in the XPath expression refer to some specific namespace in the source document. This isn't supported in the JAXP API (which was never updated to handle XPath 2.0), but it's supported in the Saxon APIs: binding the zero-length prefix to a namespace URI has the effect of making that the default namespace for element and type names.
Saxon (from 11.0) goes a step further and allows you to declare an unprefixed element matching policy that determines how unprefixed element names are handled. The possible values are:
- DefaultNamespace: this corresponds to the XPath 2.0 situation, where an unprefixed element name matches a specific default namespace, or no-namespace elements if no default has been declared.
- AnyNamespace: with this policy, writing
foo//bar
is equivalent to writing*:foo//*:bar
: names are matched by local name alone, ignoring the namespace. - DefaultNamespaceOrNone: with this policy, an unprefixed name such as
foo
matches an element with local namefoo
if it either (a) is in no namespace, or (b) is in the declared default namespace. This policy is expressly intended for processing (X)HTML documents, where it can be rather unpredictable (depending on the parser) whether elements are treated as being in no namespace or as being in the XHTML namespace. By using this policy, with the XHTML namespace as the default, you can use HTML element names such asdiv
without worrying too much about what namespace they are in.
XPath versions
W3C has published four versions of XPath: 1.0, 2.0, 3.0, and 3.1. In addition, there is a W3C Community Group, led by Saxonica, working on ideas for a version 4.0, some of which are implemented experimentally in Saxon 11.
Saxon claims 100% conformance with XPath 3.1.
The XSLT 3.0 specification gives implementations the option of supporting XPath 3.0 with selected features from XPath 3.1, or full XPath 3.1. The Saxon XPath processor only offers 3.1 (it offers neither pure XPath 3.0, nor XPath 3.0 plus these selected extensions).
The XSD 1.1 specification requires conformance with XPath 2.0, and the Saxon XPath engine therefore offers a mode of operation that aims at 2.0 conformance. This is primarily a question of blocking constructs that only became available in later versions. Generally (in this mode) Saxon achieves conformance with the 2.0 specifications where the differences can be detected statically, but it does not attempt to reproduce differences of run-time behavior. For example, the
string-join()
function in 3.1 is more flexible than 2.0 in what arguments it will accept, and Saxon implements the 3.1 rules in all cases.- XPath 2.0, 3.0, and 3.1 have all defined a backwards compatibility mode where some of the behavior defined in XPath 1.0 is retained. For example, in this mode strings are always compared as numbers, so the string "10" is greater than the string "2" (except when sorting). Saxon allows you to select backwards compatibility mode. Note that this doesn't prevent you using features of the XPath language defined in later versions; it merely affects the outcome of some functions and operators that had different behavior in 1.0.
- The experimental features of early 4.0 drafts are not available unless you explicitly enable them.
XPath variables
Very often you will want to execute the same XPath expression repeatedly, but with some parameter taking
different values: for example //person[@id='A1234']
where the required ID changes each time.
Constructing an XPath expression by string concatenation, for example "//person[@id='" + id + "']"
is bad
practice for two reasons:
Compiling an XPath expression often takes much longer than executing it; with this approach, you have to compile the expression every time it is used.
The code is open to injection attacks. Suppose the required ID is entered by an end user in a web form, and suppose the ID they enter is the string
']|doc('secret.xml')//person[@id='A1234
. Then the actual expression evaluated will be"//person[@id='']|doc('secret.xml')//person[@id='A1234']"
, which will return data from an external document that you never intended the user to see.
Instead, write the expression to contain a variable reference: "//person[@id=$requiredId]"
,
compile it once, and then execute it repeatedly binding different values to the variable $requiredId
.
All the XPath APIs provide a mechanism for doing this.