Implementing a collating sequence
By default Saxon allows a collation URI to take the form
http://saxon.sf.net/collation?keyword=value;keyword=value;...
. The query
parameters in the URI can be separated either by ampersands or semicolons, but semicolons
are usually more convenient.
The same keywords are available on the Java and .NET platforms, but because of differences in collation support between the two platforms, they may interact in slightly different ways. The same collation URI may produce different sort orders on the two platforms. (One noteworthy difference is that the Java collations treat spaces as significant, the .NET collations do not.)
The keywords available in such a collation URI are the same as in the configuration file, and are as follows:
keyword |
values |
effect |
class |
fully-qualified Java class name of a class that implements
|
This parameter should not be combined with any other parameter. An instance of
the requested class is created, and is used to perform the comparisons. Note that
if the collation is to be used in functions such as |
rules |
details of the ordering required, using the syntax of the Java
|
This defines exactly how individual characters are collated. (It's not very
convenient to specify this as part of a URI, but the option is provided for
completeness.) This option is also available on the .NET platform, and if used
will select a collation provided using the OpenJDK implementation of
|
lang |
any value allowed for |
This is used to find the collation appropriate to a Java locale or .NET culture. The collation may be further tailored using the parameters described below. |
ignore-case |
yes | no |
Indicates whether the upper and lower case letters are considered equivalent. Note that even when ignore-case is set to "no", case is less significant than the actual letter value, so that "XPath" and "Xpath" will appear next to each other in the sorted sequence. On the Java platform, setting ignore-case sets the collation strength to secondary. |
ignore-modifiers |
yes | no |
Indicates whether non-spacing combining characters (such as accents and diacritical marks) are considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the sorted sequence. On the Java platform, setting ignore-modifiers sets the collation strength to primary. |
ignore-symbols |
yes | no |
Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default. |
ignore-width |
yes | no |
Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary. |
strength |
primary | secondary | tertiary | identical |
Indicates the differences that are considered significant when comparing two strings. A/B is a primary difference; A/a is a secondary difference; a/ä is a tertiary difference (though this varies by language). So if strength=primary then A=a is true; with strength=secondary then A=a is false but a=ä is true; with strength=tertiary then a=ä is false. This option should not be combined with the ignore-XXX options. The setting "primary" is equivalent to ignoring case, modifiers, and width; "secondary" is equivalent to ignoring case and width; "tertiary" ignores width only; and "identical" ignores nothing. |
decomposition |
none | standard | full |
Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform. |
alphanumeric |
yes | no | codepoint |
If set to yes, the string is split into a sequence of alphabetic and numeric
parts (a numeric part is any consecutive sequence of ASCII digits; anything else
is considered alphabetic). Each numeric part is considered to be preceded by an
alphabetic part even if it is zero-length. The parts are then compared pairwise:
alphabetic parts using the collation implied by the other query parameters,
numeric parts using their numeric value. The result is that, for example, AD985
collates before AD1066. (This is sometimes called natural sorting.) The
value "codepoint" requests alphanumeric collation with the
"alpha" parts being collated by Unicode codepoint, rather than by the default
collation for the Locale. This may give better results in the case of
strings that contain spaces. Note that an alphanumeric collation cannot be used in
conjunction with functions such as |
case-order |
upper-first | lower-first |
Indicates whether upper case letters collate before or after lower case letters. |
This format of URI,
http://saxon.sf.net/collation?keyword=value;keyword=value;...
, is handled
by Saxon's default CollationURIResolver
. It is possible to replace or supplement this mechanism by
registering a user-written CollationURIResolver
. This must be an
implementation of the Java interface net.sf.saxon.lib.CollationURIResolver
,
which only requires a single method, resolve()
, to be implemented. The result
of the method is in general a Java Comparator
, though if the collation is to
be used in functions such as contains()
which match parts of a string rather
than the whole string, then the result must also be an instance of either
java.text.RuleBasedCollator
, or of the Saxon interface
net.sf.saxon.lib.SubstringMatcher
.
In the Java API, a user-written CollationURIResolver
is registered with the
Configuration
object, either
directly or in the case of XSLT by using the JAXP setAttribute()
method on
the TransformerFactory
(the relevant property name is FeatureKeys.COLLATION_URI_RESOLVER
). This applies to all stylesheets and queries
compiled and executed under that configuration.
It is also possible to register a collation (for example as an instance of the Java class
Collator
or Comparator
) with the Configuration
.
Such explicitly registered collations (together with those registered via the
configuration file) are used before calling the CollationURIResolver
. In
addition, the APIs provided for executing XPath and XQuery expressions allow named
collations to be registered by the calling application, as part of the static context.
At present there are no equivalent facilities in the .NET API (other than the use of the configuration file), though it is possible to manipulate collations by dropping down into the Java interface.