Unicode Collation Algorithm

XSL Transformations (XSLT) Version 3.0 and XQuery/XPath 3.1 support the use of the Unicode Collation Algorithm (UCA) for comparing strings in a variety of locales and with extensive parametric control. This feature is requested by using a collation (specified either as a collation attribute on an xsl:sort instruction, an xsl:default-collation attribute within an XSLT tree or the $collation argument of a 'comparison' function, such as fn:compare() or fn:deep-equal()) that uses the scheme and path http://www.w3.org/2013/collation/UCA followed by an optional query part.

The query is a semicolon-separated sequence of zero or more keyword=value pairs, e.g. ...UCA?reorder=digit,space;strength=secondary. Full details of the query format and parameters can be found in The Unicode Collation Algorithm section of the specification. This section discusses the Saxon implementation.

Full support of UCA is only provided in Saxon-PE/EE from version 9.6 – Saxon-HE uses fallback behaviour described below.

Saxon-PE/EE uses the features of ICU - International Components for Unicode to support UCA. More detailed information is available from that site.

The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site. In the case that the ICU features have not been loaded within Saxon-PE/EE, fallback behaviour described below is used.

Specifics of the parameters and behaviour for Saxon-PE/EE implementation are:

keyword

values

default

Notes

fallback

yes | no

yes

fallback=no will raise errors in the case of unknown parameter keywords or values. Otherwise erroneous parameters are ignored.

lang

any value allowed for xml:lang, for example en-US for US English, or sr-Cyrl-ME for cyrillic script Serbian in Montenegro

From the locale

The implementation uses an appropriate collation for the requested locale from the ICU environment, splitting the lang parameter into three possible subcomponents: language-country-variant. The locale used may effect the default values of other parameters - see backwards for an example.

For a list of locales supported in this implementation see UCA-supported locales

version

string

6.2.0.0

The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error.

strength

primary | secondary | tertiary | quaternary | identical, or 1 | 2 | 3 | 4 | 5 as synonyms

tertiary, but see notes

Default strength may be altered by a specific locale.

maxVariable

space | punct | symbol | currency

punct

Determines which characters are considered as "noise" for the purposes of the alternate parameter. The default value punct causes whitespace and punctuation to be treated as noise characters. (Note that this includes characters that are obviously punctuation, like full-stop, comma, and parentheses, while excluding symbols such as the plus sign, equals sign, and copyright sign. But - (hyphen), #, & %, and * are classed as punctuation.)

alternate

non-ignorable | shifted | blanked

non-ignorable

This (poorly named) property controls the handling of "noise" characters such as spaces and punctuation. More specifically, it controls the handling of characters up to the value of maxVariable. For example if maxVariable=punct then it affects handling of whitespace and punctuation, while if maxVariable=currency then it also affects the handling of currency symbols. The value non-ignorable causes noise characters to be treated as first-class characters in their own right. The value shifted indicates that noise characters are treated as a quaternary distinction between strings (less significant than differences in accents or case), while blanked indicates that they are used only to distinguish strings that would otherwise be considered identical. The value blanked is not supported directly in the ICU library; if requested, it is handled by requesting alternate=shifted with strength=tertiary.

backwards

yes | no

no, but see notes

This is principally used for backwards-order comparison of (French) accents at secondary strength, and the default may be set by the locale used. For example lang=fr-CA implies a default of backwards=yes whereas lang=fr defaults to backward=no.

normalization

yes | no

no

normalization=yes has not been tested. See ICU documentation for further details.

caseLevel

yes | no

no

As specified by W3C.

caseFirst

upper | lower

See notes

The default is to ignore case preferences.

numeric

yes | no

no

As specified by W3C.

reorder

a comma-separated sequence of reorder codes, where a reorder code is one of space, punct, symbol, currency, digit, or a four-letter script code

As specified by W3C. Saxon testing revealed a bug in the ICU library which has been reported, but is not fixed at the time of writing.

UCA Fallback Behaviour

The specification supports fallback behaviour in the case that UCA is not, or only partially, implemented. If the query contains the parameter fallback=no and implementation of UCA is unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the request will raise an error of 'unknown collation'. If fallback is yes or absent, the implementation will make best-effort.

For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation as described in Implementing a collating sequence, with the following re-mapping of parameters:

keyword(s)

values

effect

lang

language/locale code

Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation.

strength

primary | secondary | tertiary | identical

used as stated.

1 | 2 | 3

remapped to primary | secondary | tertiary respectively.

quaternary | 4 | 5

remapped to identical.

caseFirst

upper | lower

remapped to case-order=upper-first and case-order=lower-first respectively.

numeric

yes | no

remapped to alphanumeric=yes and alphanumeric=no respectively.

version, alternate, backwards, normalization, caseLevel, reorder

-

All ignored.