Unicode Collation Algorithm
This section provides more detail on Saxon's support for the Unicode Collation Algorithm.
The Unicode Collation Algorithm is implemented using different libraries on different platforms, with differing levels of conformance:
On SaxonJ-HE, Saxon uses the collation facilities available directly from the JDK. Because these collations are not always 100% compatible with the UCA as defined by Unicode, Saxon rejects the collation as unsupported if it specifies
fallback=no
.On SaxonJ-PE and -EE, Saxon uses the collation facilities available from the ICU4J library.
Three conditions must be satisfied for this to work:
- You must be using the SaxonJ PE or EE JAR files.
- You must have a license file enabled.
- The ICU4J JAR file (for example
icu4j-72.1.jar
) must be on the classpath.
If any of these conditions is not satisfied, Saxon falls back silently to using collations from the JDK. If it is important to use the ICU collations, use
fallback=no
in the collation URI to prevent this happening.- On SaxonCS, Saxon uses the collation facilities available from the ICU4N library, which is a port to C# of a subset of ICU4J.
This feature is requested by using a collation (specified either as a collation
attribute
on an xsl:sort
instruction, an xsl:default-collation
attribute within an XSLT
tree, or the $collation
argument of a comparison function, such as fn:compare()
or fn:deep-equal()
) that uses the scheme and path
http://www.w3.org/2013/collation/UCA
followed by an optional query part.
The query is a semicolon-separated sequence of zero or more keyword=value
pairs, e.g.
...UCA?reorder=digit,space;strength=secondary
. Full details of the query format and
parameters can be found in The
Unicode Collation Algorithm section of the specification. This section discusses the Saxon
implementation.
Saxon supports the W3C-defined parameters as follows:
keyword |
values |
default |
Notes |
fallback |
yes | no |
yes |
|
lang |
any value allowed for |
From the locale |
The implementation uses an appropriate collation for the requested locale from the ICU
environment, splitting the For a list of locales supported in this implementation see UCA-supported locales |
version |
string |
6.2.0.0 |
The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error. |
strength |
primary | secondary | tertiary | quaternary | identical, or 1 | 2 | 3 | 4 | 5 as synonyms |
tertiary, but see notes |
Default strength may be altered by a specific locale. |
maxVariable |
space | punct | symbol | currency |
punct |
Determines which characters are considered as "noise" for the purposes of the
|
alternate |
non-ignorable | shifted | blanked |
non-ignorable |
This (poorly named) property controls the handling of "noise" characters such as spaces and
punctuation. More specifically, it controls the handling of characters up to the value of
|
backwards |
yes | no |
no, but see notes |
This is principally used for backwards-order comparison of (French) accents at secondary
strength, and the default may be set by the locale used. For example |
normalization |
yes | no |
no |
|
caseLevel |
yes | no |
no |
As specified by W3C. |
caseFirst |
upper | lower |
See notes |
The default is to ignore case preferences. |
numeric |
yes | no |
no |
As specified by W3C. |
reorder |
a comma-separated sequence of reorder codes, where a reorder code is one of |
As specified by W3C. Saxon testing revealed a bug in the ICU library which has been reported, but is not fixed at the time of writing. |
Additional Saxon parameters
In addition to the standard parameters, Saxon supports some further parameters of its own:
keyword |
values |
effect |
class |
fully-qualified Java class name of a class that implements
|
This parameter should not be combined with any other parameter. An instance of the requested
class is created, and is used to perform the comparisons. Note that if the collation is to be
used in functions such as |
rules |
details of the ordering required, using the syntax of the Java |
This defines exactly how individual characters are collated. (It's not very convenient to
specify this as part of a URI, but the option is provided for completeness.) This option is also
available on the .NET platform, and if used will select a collation provided using the OpenJDK
implementation of |
ignore-case |
yes | no |
Indicates whether the case of letters should be ignored: equivalent to
|
ignore-modifiers |
yes | no |
Indicates whether non-spacing combining characters (such as accents and diacritical marks) are
considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less
significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each
other in the sorted sequence. Equivalent to |
ignore-symbols |
yes | no |
Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default. |
ignore-width |
yes | no |
Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary. |
decomposition |
none | standard | full |
Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform. |
alphanumeric |
yes | no | codepoint |
If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric
part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each
numeric part is considered to be preceded by an alphabetic part even if it is zero-length. The
parts are then compared pairwise: alphabetic parts using the collation implied by the other
query parameters, numeric parts using their numeric value. The result is that, for example,
AD985 collates before AD1066. (This is sometimes called natural sorting.) The value
"codepoint" requests alphanumeric collation with the "alpha" parts being collated by
Unicode codepoint, rather than by the default collation for the Locale. This may give
better results in the case of strings that contain spaces. Note that an alphanumeric collation
cannot be used in conjunction with functions such as |
case-order |
upper-first | lower-first |
Indicates whether upper case letters collate before or after lower case letters. |
UCA fallback behaviour
The specification supports fallback behaviour in the case that UCA is not, or only partially,
implemented. If the query contains the parameter fallback=no
and implementation of UCA is
unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the request will
raise an error of 'unknown collation'. If fallback
is yes
or absent, the
implementation will make best-effort.
For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation, with the following re-mapping of parameters:
keyword(s) |
values |
effect |
lang |
language/locale code |
Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation. |
strength |
primary | secondary | tertiary | identical |
used as stated. |
1 | 2 | 3 |
remapped to |
|
quaternary | 4 | 5 |
remapped to |
|
caseFirst |
upper | lower |
remapped to |
numeric |
yes | no |
remapped to |
version, alternate, backwards, normalization, caseLevel, reorder |
- |
All ignored. |