Unicode Collation Algorithm
XSL Transformations (XSLT) Version
3.0 and XQuery/XPath 3.1 support the use of the Unicode Collation Algorithm (UCA) for comparing strings in a
variety of locales and with extensive parametric control. This feature is requested by using a
collation (specified either as a collation
attribute on an xsl:sort
instruction, an xsl:default-collation
attribute within an XSLT tree or the
$collation
argument of a 'comparison' function, such as fn:compare()
or fn:deep-equal()
) that uses the scheme and path
http://www.w3.org/2013/collation/UCA
followed by an optional query part.
The query is a semicolon-separated sequence of zero or more keyword=value
pairs,
e.g. ...UCA?reorder=digit,space;strength=secondary
. Full details of the query
format and parameters can be found in The
Unicode Collation Algorithm section of the specification. This section discusses the Saxon
implementation.
Full support of UCA is only provided in Saxon-PE/EE from version 9.6 – Saxon-HE uses fallback behaviour described below.
Saxon-PE/EE uses the features of ICU - International Components for Unicode to support UCA. More detailed information is available from that site.
The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site. In the case that the ICU features have not been loaded within Saxon-PE/EE, fallback behaviour described below is used.
Specifics of the parameters and behaviour for Saxon-PE/EE implementation are:
keyword |
values |
default |
Notes |
fallback |
yes | no |
yes |
|
lang |
any value allowed for |
From the locale |
The implementation uses an appropriate collation for the requested locale from the ICU
environment, splitting the For a list of locales supported in this implementation see UCA-supported locales |
version |
string |
6.2.0.0 |
The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error. |
strength |
primary | secondary | tertiary | quaternary | identical, or 1 | 2 | 3 | 4 | 5 as synonyms |
tertiary, but see notes |
Default strength may be altered by a specific locale. |
maxVariable |
space | punct | symbol | currency |
punct |
Determines which characters are considered as "noise" for the purposes of the |
alternate |
non-ignorable | shifted | blanked |
non-ignorable |
This (poorly named) property controls the handling of "noise" characters such as spaces and punctuation.
More specifically, it controls the handling of characters up to the value of |
backwards |
yes | no |
no, but see notes |
This is principally used for backwards-order comparison of (French) accents at
secondary strength, and the default may be set by the locale used. For example
|
normalization |
yes | no |
no |
|
caseLevel |
yes | no |
no |
As specified by W3C. |
caseFirst |
upper | lower |
See notes |
The default is to ignore case preferences. |
numeric |
yes | no |
no |
As specified by W3C. |
reorder |
a comma-separated sequence of reorder codes, where a reorder code is one of
|
As specified by W3C. Saxon testing revealed a bug in the ICU library which has been reported, but is not fixed at the time of writing. |
UCA Fallback Behaviour
The specification supports fallback behaviour in the case that UCA is not, or only partially,
implemented. If the query contains the parameter fallback=no
and implementation of
UCA is unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the
request will raise an error of 'unknown collation'. If fallback
is yes
or absent, the implementation will make best-effort.
For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation as described in Implementing a collating sequence, with the following re-mapping of parameters:
keyword(s) |
values |
effect |
lang |
language/locale code |
Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation. |
strength |
primary | secondary | tertiary | identical |
used as stated. |
1 | 2 | 3 |
remapped to |
|
quaternary | 4 | 5 |
remapped to |
|
caseFirst |
upper | lower |
remapped to |
numeric |
yes | no |
remapped to |
version, alternate, backwards, normalization, caseLevel, reorder |
- |
All ignored. |