Unicode Collation Algorithm

This section provides more detail on Saxon's support for the Unicode Collation Algorithm.

The Unicode Collation Algorithm is implemented using different libraries on different platforms, with differing levels of conformance:

On SaxonJ-HE, Saxon uses the collation facilities available directly from the JDK.
On SaxonJ-PE and -EE, Saxon uses the collation facilities available from the ICU-J library.
On SaxonCS, Saxon uses the collation facilities available from the ICU-N library, which is a port to C# of a subset of ICU-J.

This feature is requested by using a collation (specified either as a collation attribute on an xsl:sort instruction, an xsl:default-collation attribute within an XSLT tree, or the $collation argument of a comparison function, such as fn:compare() or fn:deep-equal()) that uses the scheme and path http://www.w3.org/2013/collation/UCA followed by an optional query part.

The query is a semicolon-separated sequence of zero or more keyword=value pairs, e.g. ...UCA?reorder=digit,space;strength=secondary. Full details of the query format and parameters can be found in The Unicode Collation Algorithm section of the specification. This section discusses the Saxon implementation.

Saxon supports the W3C-defined parameters as follows:

keyword	values	default	Notes
fallback	yes \| no	yes	`fallback=no` will raise errors in the case of unknown parameter keywords or values. Otherwise erroneous parameters are ignored.
lang	any value allowed for `xml:lang`, for example `en-US` for US English, or `sr-Cyrl-ME` for cyrillic script Serbian in Montenegro	From the locale	The implementation uses an appropriate collation for the requested locale from the ICU environment, splitting the `lang` parameter into three possible subcomponents: `language-country-variant`. The locale used may effect the default values of other parameters - see `backwards` for an example. For a list of locales supported in this implementation see UCA-supported locales
version	string	6.2.0.0	The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error.
strength	primary \| secondary \| tertiary \| quaternary \| identical, or 1 \| 2 \| 3 \| 4 \| 5 as synonyms	tertiary, but see notes	Default strength may be altered by a specific locale.
maxVariable	space \| punct \| symbol \| currency	punct	Determines which characters are considered as "noise" for the purposes of the `alternate` parameter. The default value `punct` causes whitespace and punctuation to be treated as noise characters. (Note that this includes characters that are obviously punctuation, like full-stop, comma, and parentheses, while excluding symbols such as the plus sign, equals sign, and copyright sign. But - (hyphen), #, & %, and * are classed as punctuation.)
alternate	non-ignorable \| shifted \| blanked	non-ignorable	This (poorly named) property controls the handling of "noise" characters such as spaces and punctuation. More specifically, it controls the handling of characters up to the value of `maxVariable`. For example if `maxVariable=punct` then it affects handling of whitespace and punctuation, while if `maxVariable=currency` then it also affects the handling of currency symbols. The value `non-ignorable` causes noise characters to be treated as first-class characters in their own right. The value `shifted` indicates that noise characters are treated as a quaternary distinction between strings (less significant than differences in accents or case), while `blanked` indicates that they are used only to distinguish strings that would otherwise be considered identical. The value `blanked` is not supported directly in the ICU library; if requested, it is handled by requesting `alternate=shifted` with `strength=tertiary`.
backwards	yes \| no	no, but see notes	This is principally used for backwards-order comparison of (French) accents at secondary strength, and the default may be set by the locale used. For example `lang=fr-CA` implies a default of `backwards=yes` whereas `lang=fr` defaults to `backward=no`.
normalization	yes \| no	no	`normalization=yes` has not been tested. See ICU documentation for further details.
caseLevel	yes \| no	no	As specified by W3C.
caseFirst	upper \| lower	See notes	The default is to ignore case preferences.
numeric	yes \| no	no	As specified by W3C.
reorder	a comma-separated sequence of reorder codes, where a reorder code is one of `space, punct, symbol, currency, digit`, or a four-letter script code		As specified by W3C. Saxon testing revealed a bug in the ICU library which has been reported, but is not fixed at the time of writing.

Additional Saxon parameters

In addition to the standard parameters, Saxon supports some further parameters of its own:

keyword	values	effect
class	fully-qualified Java class name of a class that implements `java.util.Comparator`.	This parameter should not be combined with any other parameter. An instance of the requested class is created, and is used to perform the comparisons. Note that if the collation is to be used in functions such as `contains()` and `starts-with()`, this class must also be a `java.text.RuleBasedCollator`. This approach allows a user-defined collation to be implemented in Java. This option is also available on the .NET platform, but the class must implement the Java interface `java.util.Comparator`.
rules	details of the ordering required, using the syntax of the Java `RuleBasedCollator`	This defines exactly how individual characters are collated. (It's not very convenient to specify this as part of a URI, but the option is provided for completeness.) This option is also available on the .NET platform, and if used will select a collation provided using the OpenJDK implementation of `RuleBasedCollator`.
ignore-case	yes \| no	Indicates whether the case of letters should be ignored: equivalent to `strength=secondary`.
ignore-modifiers	yes \| no	Indicates whether non-spacing combining characters (such as accents and diacritical marks) are considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the sorted sequence. Equivalent to `strength=secondary`.
ignore-symbols	yes \| no	Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default.
ignore-width	yes \| no	Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary.
decomposition	none \| standard \| full	Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform.
alphanumeric	yes \| no \| codepoint	If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each numeric part is considered to be preceded by an alphabetic part even if it is zero-length. The parts are then compared pairwise: alphabetic parts using the collation implied by the other query parameters, numeric parts using their numeric value. The result is that, for example, AD985 collates before AD1066. (This is sometimes called natural sorting.) The value "codepoint" requests alphanumeric collation with the "alpha" parts being collated by Unicode codepoint, rather than by the default collation for the Locale. This may give better results in the case of strings that contain spaces. Note that an alphanumeric collation cannot be used in conjunction with functions such as `contains()` and `substring-before()`.
case-order	upper-first \| lower-first	Indicates whether upper case letters collate before or after lower case letters.

UCA fallback behaviour

The specification supports fallback behaviour in the case that UCA is not, or only partially, implemented. If the query contains the parameter fallback=no and implementation of UCA is unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the request will raise an error of 'unknown collation'. If fallback is yes or absent, the implementation will make best-effort.

For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation, with the following re-mapping of parameters:

keyword(s)	values	effect
lang	language/locale code	Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation.
strength	primary \| secondary \| tertiary \| identical	used as stated.
	1 \| 2 \| 3	remapped to `primary \| secondary \| tertiary` respectively.
	quaternary \| 4 \| 5	remapped to `identical`.
caseFirst	upper \| lower	remapped to `case-order=upper-first` and `case-order=lower-first` respectively.
numeric	yes \| no	remapped to `alphanumeric=yes` and `alphanumeric=no` respectively.
version, alternate, backwards, normalization, caseLevel, reorder	-	All ignored.