Sorting and collations
Different countries (or languages) have different rules for sorting strings into alphabetical order. For example, in German "Ä" comes between "A" and "B", while in Swedish it comes after "Z". (And the rules are a lot more complicated than this, because diacritical marks are ignored unless all the letters in the word are identical.)
In addition, two strings such as ("ALPHA", "alpha"), or ("Jäger", "Jaeger") may or may not be considered to match when comparing for equality. In this case the rules depend less on the language involved, and more on the requirements of the application.
All operations in XPath, XSLT, and XQuery that depend on ordering strings therefore allow a collation to be specified. A collation is simply a rule for deciding whether two strings are equal, and if not, which one sorts first. Collations are identified using a URI.
A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions (xsl:sort
,
xsl:for-each-group
, xsl:merge-key
in XSLT, and in the order by
clause of a FLWOR expression in XQuery.)
In Saxon the default collation is always the "codepoint" collation. This collates strings based on the
integer values assigned by Unicode to each character: for example "ah!" sorts before "ah?" because the
Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63). This
generally gives good results for artificial strings such as part numbers, vehicle registration marks,
and file names, but it's inadequate for natural language text. The codepoint collation may be requested
explicitly using the URI http://www.w3.org/2005/xpath-functions/collation/codepoint
.
The default collation may be changed for a portion of an XSLT stylesheet by use of an
[xsl:]default-collation
attribute on an enclosing element; and it can be changed for an
XQuery module using the declare default collation
declaration in the XQuery prolog. It can
also be changed using the Saxon API, for example XsltCompiler.declareDefaultCollation(X);
(SaxonJ) or XsltCompiler.DefaultCollationName = X;
(SaxonCS).
Two more kinds of collation are defined in the W3C language specifications, and are recognized in all versions of Saxon (though there may be differences in the details of the output):
- The ASCII case-blind collation (URI
http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive
) uses codepoint comparison for most characters, but treats ASCII lower-case letters as equal to their upper-case equivalents. This collation is defined by the HTML5 standard, and it's an efficient way of matching ASCII keywords such as "ascending" and "descending", but it's not suitable for more general text. It's intended only for use in equality comparisons, not for sorting. - The Unicode Collation Algorithm (UCA) represents a family of collations, with the exact rules
depending on parameters that are supplied. A typical URI for a UCA collation would be
http://www.w3.org/2013/collation/UCA?lang=sv;strength=primary
where the parameters after the "?" indicate the detailed rules to be applied.
Further details on the Unicode Collation Algorithm are supplied below.
Saxon also allows a collation to be supplied programmatically:
- On SaxonJ, use the method Processor.declareCollation() or Configuration.registerCollation().
- On SaxonCS, use the method Processor.DeclareCollation().
When the xsl:sort
instruction is used without an explicit collation, Saxon attempts to
construct a collation using the supplied values of attributes such as lang
and
case-order
.
For backwards compatibility reasons the standard collation resolver in Saxon also accepts URIs in the
form http://saxon.sf.net/collation
followed by query parameters; the query parameters that
are recognized are the same as those defined by W3C UCA collation URIs.