System Programming Interfaces

Strings

Extensive changes have been made to the internal representation of strings. There are a number of motivations for this, including better Unicode support, capitalizing on improvements to string handling in Java 9, and most notably, efficiency on the .NET platform: Saxon's previous extensive use of the Java CharSequence interface appeared to translate to very inefficient code on .NET.

Most uses of CharSequence have been replaced by a new class net.sf.saxon.str.UnicodeString (which also replaces the old class net.sf.saxon.regex.UnicodeString). The UnicodeString class has a number of implementations. All of them are designed to be codepoint-addressible: they expose an indexable array of 32-bit codepoint values, and never use surrogate pairs. The implementations of UnicodeString include:

Twine8: a string consisting entirely of codepoints in the range 1-255, held in an array with one byte per character.
Twine16: a string consisting entirely of codepoints in the range 1-65535, held in an array with two bytes per character.
Twine24: a string of arbitrary codepoints, held in an array with three bytes per character.
Slice8: a sub-range of an array using one byte per character.
Slice16: a sub-range of an array using two bytes per character.
Slice24: a sub-range of an array using two bytes per character.
BMPString: a wrapper around a Java/C# string known to contain no surrogate pairs.
ZenoString: a composite string held as a list of segments, each of which is itself a UnicodeString. The name derives from the algorithm used to combine segments, which results in segments having progressively decreasing lengths towards the end of the string.
StringView: a wrapper around an arbitrary Java/C# string. (This stores the string both in its native Java/C# form, and using a "real" codepoint-addressible implementation of UnicodeString, which is constructed lazily when it is first required.)

The interface to the UnicodeString class is future-proofed to accommodate strings containing more than 2^31 characters. That's not to say that Saxon can now support such long strings everywhere (for example, the regular expression engine cannot handle such strings); but the groundwork has been laid.

The method Item.getStringValueCS(), which returned the string value of an item as a CharSequence, is dropped, and is replaced by a new method Item.getUnicodeStringValue() which returns the value as a UnicodeString.

The effect of toString() on atomic values has changed: it now returns the result of casting the value to a string (which is the same as the result of getStringValue()). To obtain the previous effect of the toString() method, use the new show() method.

Unicode normalization of strings (for example in the fn:normalize-unicode() function) now uses the JDK class java.text.Normalizer rather than code derived from the Unicode Consortium's implementation. This appears to be substantially faster.

Sequence and SequenceIterator

The SequenceIterator.next() method no longer throws checked (XPathException) exceptions. Instead, if a dynamic error occurs, an UncheckedXPathException is thrown. This change makes the SequenceIterator class play better with modern Java facilities such as streams and functional interfaces.

The SequenceIterator.getProperties() method is dropped. Instead, to determine whether a SequenceIterator supports look-ahead, first test whether it is an instance of LookaheadIterator, then cast it to LookaheadIterator and call the supportsHasNext() method; and similarly for GroundedIterator and LastPositionFinder. For example:

if (iter instanceof LookaheadIterator and ((LookaheadIterator)iter).supportsHasNext() and ((LookaheadIterator)iter).hasNext())...

NodeInfo

The method NodeInfo.iterateAxis(int axisNumber, Predicate<? super NodeInfo> nodeTest) is replaced by NodeInfo.iterateAxis(int axisNumber, NodePredicate predicate).

The reason for this is to facilitate conversion of the source code from Java to C#. In Java, a functional interface can be satisfied both by a lambda expression and by a concrete implementation class; in C#, classes and delegates are not interchangeable in the same way. The introduction of NodePredicate solves this by providing a concrete implementation that allows a lambda expression to be supplied as an argument. Similar changes have been made in some other, less visible, areas.

Miscellaneous

A number of internal changes have been made to facilitate conversion of the source code from Java to C#. These should only affect applications that use very low-level interfaces within Saxon.

For example, in some data objects such as ParseOptions, some properties were maintained as three-valued fields of type java.lang.Boolean (true, false, or null - meaning unspecified). C# booleans do not have a null in the value space so the representation has changed, typically to Optional<Boolean>. The same applies to enumeration types where there was a need for a "null" in the value space; in some cases an extra enumeration constant such as UNKNOWN has been added.

Code designed to implement or use JAXP interfaces was previously scattered around the product rather liberally. Because JAXP interfaces exist in Java but not in C#, this code has often been moved into separate modules that are platform-specific.

Tracing and Diagnostics

A new class SystemLogger simplifies the task of sending all Saxon progress messages (as well as <xsl:message> output) to a supplied java.util.logging.Logger. This can be achieved using a call such as:

configuration.setLogger(new SystemLogger(Logger.getAnonymousLogger()))