Collation
Collations used for comparing strings can be specified by means of a URI. A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions ( xsl:sort , xsl:for-each-group , xsl:merge-key in XSLT, and in the order by clause of a FLWOR expression in XQuery.
Saxon provides a range of mechanisms for binding collation URIs. The language specifications simply say that collations used in sorting and in string-comparison functions are identified by a URI, and leaves it up to the implementation how these URIs are defined.
There are some predefined collations that cannot be changed. Specifically:
-
The Unicode Codepoint Collation defined in the W3C specifications (see http://www.w3.org/2005/xpath-functions/collation/codepoint). This collates strings based on the integer values assigned by Unicode to each character, for example "ah!" sorts before "ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63).
-
The "HTML ASCII case-blind collation", http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive". This is designed to mimic the HTML5 rules for matching many names and keywords, whereby case distinctions are ignored for the English letters (A-Z, a-z) but not for accented or non-English letters.
Saxon implements this effectively by converting upper-case letters A-Z to their lower-case equivalents, and then using the codepoint collation.
-
The family of collations implementing the Unicode Collation Algorithm, 'http://www.w3.org/2013/collation/UCA?keyword=value;...
Saxon-PE and Saxon-EE implement UCA collations by use of the ICU-J open source library, which supports a large range of languages and all the parameters defined in the UCA specification. Saxon-HE has a simpler implementation which relies on the collation support available in the built-in Java library; the languages this supports depend on the particular Java installation, and not all parameters are supported. For this reason, Saxon-HE supports UCA collations only if fallback implementation is allowed (that is, if the option
fallback=no
is not present in the collation URI.
For backwards compatibility reasons the standard collation resolver in Saxon also
accepts URIs in the form http://saxon.sf.net/collation
followed by query
parameters; the query parameters that are recognized are the same as those defined by W3C
UCA collation URIs.
The keywords defined by W3C are: fallback
, lang
,
version
, strength
, maxVariable
, alternate
,
backwards
, normalization
, caseLevel
, caseFirst
,
numeric
, reorder
. The values for these parameters and their meaning
can be found at https://www.w3.org/TR/xslt-30/#uca-collations.
Whether the W3C URI http://www.w3.org/2013/collation/UCA
or the
Saxon URI http://saxon.sf.net/collation
is used, Saxon accepts a number
of collation parameters additional to those defined by W3C, as follows:
keyword |
values |
effect |
class |
fully-qualified Java class name of a class that implements
|
This parameter should not be combined with any other parameter. An instance of
the requested class is created, and is used to perform the comparisons. Note that
if the collation is to be used in functions such as |
rules |
details of the ordering required, using the syntax of the Java
|
This defines exactly how individual characters are collated. (It's not very
convenient to specify this as part of a URI, but the option is provided for
completeness.) This option is also available on the .NET platform, and if used
will select a collation provided using the OpenJDK implementation of
|
ignore-case |
yes | no |
Indicates whether the case of letters should be ignored: equivalent to |
ignore-modifiers |
yes | no |
Indicates whether non-spacing combining characters (such as accents and
diacritical marks) are considered significant. Note that even when
ignore-modifiers is set to "no", modifiers are less significant than the actual
letter value, so that "Hofen" and "Höfen" will appear next to each other in the
sorted sequence. Equivalent to |
ignore-symbols |
yes | no |
Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default. |
ignore-width |
yes | no |
Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary. |
decomposition |
none | standard | full |
Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform. |
alphanumeric |
yes | no | codepoint |
If set to yes, the string is split into a sequence of alphabetic and numeric
parts (a numeric part is any consecutive sequence of ASCII digits; anything else
is considered alphabetic). Each numeric part is considered to be preceded by an
alphabetic part even if it is zero-length. The parts are then compared pairwise:
alphabetic parts using the collation implied by the other query parameters,
numeric parts using their numeric value. The result is that, for example, AD985
collates before AD1066. (This is sometimes called natural sorting.) The
value "codepoint" requests alphanumeric collation with the
"alpha" parts being collated by Unicode codepoint, rather than by the default
collation for the Locale. This may give better results in the case of
strings that contain spaces. Note that an alphanumeric collation cannot be used in
conjunction with functions such as |
case-order |
upper-first | lower-first |
Indicates whether upper case letters collate before or after lower case letters. |
If you want to use your own URIs to define collations, there are two ways of doing this:
-
You can use the Saxon configuration file to define collations: see The collations element.
-
You can register a collation with the Saxon Configuration using the method s9api method
Processor.declareCollation()
or the underlying methodConfiguration.registerCollation()
.
In either case, the collation is supplied in the form
of an implementation of the interface net.sf.saxon.lib.StringCollator
. You must supply methods
compareStrings()
and comparesEqual()
for comparing strings for
ordering or equality, and a method getCollationKey() which returns a collation key
for any string. If you want your collation to be used in calls of fn:contains()
,
fn:starts-with()
, fn:ends-with()
, fn:substring-before()
,
or fn:substring-after()
, then it must also implement the interface
net.sf.saxon.lib.SubstringMatcher
.