System Programming Interfaces
Strings
Extensive changes have been made to the internal representation of strings. There are a number of motivations
for this, including better Unicode support, capitalizing on improvements to string handling in Java 9, and most
notably, efficiency on the .NET platform: Saxon's previous extensive use of the Java CharSequence
interface
appeared to translate to very inefficient code on .NET.
Most uses of CharSequence
have been replaced by a new class net.sf.saxon.str.UnicodeString (which also replaces the old class
net.sf.saxon.regex.UnicodeString
). The UnicodeString
class has a number of implementations.
All of them are designed to be codepoint-addressible: they expose an indexable array of 32-bit codepoint values, and
never use surrogate pairs. The implementations of UnicodeString
include:
- Twine8: a string consisting entirely of codepoints in the range 1-255, held in an array with one byte per character.
- Twine16: a string consisting entirely of codepoints in the range 1-65535, held in an array with two bytes per character.
- Twine24: a string of arbitrary codepoints, held in an array with three bytes per character.
- Slice8: a sub-range of an array using one byte per character.
- Slice16: a sub-range of an array using two bytes per character.
- Slice24: a sub-range of an array using two bytes per character.
- BMPString: a wrapper around a Java/C# string known to contain no surrogate pairs.
-
ZenoString: a composite string held as a list of segments, each of
which is itself a
UnicodeString
. The name derives from the algorithm used to combine segments, which results in segments having progressively decreasing lengths towards the end of the string. -
StringView: a wrapper around an arbitrary Java/C# string. (This
stores the string both in its native Java/C# form, and using a "real" codepoint-addressible implementation of
UnicodeString
, which is constructed lazily when it is first required.)
The interface to the UnicodeString
class is future-proofed to accommodate strings containing more than 2^31
characters. That's not to say that Saxon can now support such long strings everywhere (for example, the regular expression
engine cannot handle such strings); but the groundwork has been laid.
The method Item.getStringValueCS()
, which returned the string value of an item as a CharSequence
,
is dropped, and is replaced by a new method Item.getUnicodeStringValue() which returns the value as a
UnicodeString
.
The effect of toString()
on atomic values has changed: it now returns the result of casting the value
to a string (which is the same as the result of getStringValue()
). To obtain the previous effect of the
toString()
method, use the new show()
method.
Unicode normalization of strings (for example in the fn:normalize-unicode() function) now uses the JDK class
java.text.Normalizer
rather than code derived from the Unicode Consortium's implementation. This appears
to be substantially faster.
Sequence and SequenceIterator
The SequenceIterator.next() method no longer throws checked
(XPathException
) exceptions. Instead, if a dynamic error occurs, an UncheckedXPathException is thrown. This change makes the
SequenceIterator
class play better with modern Java facilities such as streams and functional interfaces.
The SequenceIterator.getProperties()
method is dropped. Instead, to determine whether a SequenceIterator supports look-ahead, first test whether it is an
instance of LookaheadIterator, then cast it to
LookaheadIterator
and call the supportsHasNext()
method; and similarly for
GroundedIterator and LastPositionFinder. For example:
NodeInfo
The method NodeInfo.iterateAxis(int axisNumber, Predicate<? super NodeInfo> nodeTest)
is replaced by NodeInfo.iterateAxis(int axisNumber, NodePredicate predicate)
.
The reason for this is to facilitate conversion of the source code from Java to C#. In Java, a functional interface can be satisfied both by a lambda expression and by a concrete implementation class; in C#, classes and delegates are not interchangeable in the same way. The introduction of NodePredicate solves this by providing a concrete implementation that allows a lambda expression to be supplied as an argument. Similar changes have been made in some other, less visible, areas.
Miscellaneous
A number of internal changes have been made to facilitate conversion of the source code from Java to C#. These should only affect applications that use very low-level interfaces within Saxon.
For example, in some data objects such as ParseOptions
, some properties were maintained
as three-valued fields of type java.lang.Boolean
(true, false, or null - meaning unspecified). C# booleans
do not have a null in the value space so the representation has changed, typically to Optional<Boolean>
.
The same applies to enumeration types where there was a need for a "null" in the value space; in some cases an extra
enumeration constant such as UNKNOWN
has been added.
Code designed to implement or use JAXP interfaces was previously scattered around the product rather liberally. Because JAXP interfaces exist in Java but not in C#, this code has often been moved into separate modules that are platform-specific.
Tracing and Diagnostics
A new class SystemLogger simplifies the task of sending all Saxon
progress messages (as well as <xsl:message>
output) to a supplied
java.util.logging.Logger
. This can be achieved using a call such as: