Numbers and Dates from ICU

ICU - International Components for Unicode (ICU) provides extensive facilities for localized numbering and date formatting, which are supported in Saxon-PE and -EE from version 9.6.

The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site.

In the case that the ICU features have not been loaded within Saxon-PE/EE, support for numbering and dates for Danish, Dutch, Flemish, French (and Belgian French), German, Italian and Swedish is provided as detailed in the table of Numberings for selected languages.

Numbering

ICU supports a large set of language-specific rulesets for supporting different forms of spelled-out numbering and digit-ordinal treatment. For example:

Language Number Ruleset XPath Result
en
English
1123 %spellout-cardinal format-integer(1123,'Ww','en-x-sc') One Thousand One Hundred Twenty Three
en-US
English(US)
242 %spellout-ordinal-verbose format-integer(242,'Ww','en-US-x-sov') Two Hundred and Forty Second
cy
Welsh
132 %spellout-cardinal-feminine format-integer(132,'Ww','cy-x-scf') Un Cant Tri Deg Dwy
es-PT
Portugese
242 %spellout-ordinal-feminine format-integer(242,'Ww','es-PT-x-sof') Ducentésima Cuadragésima Segunda
en
English
1123 format-integer(1123,'1;o','en') 1123rd
en-US
English(US)
242 format-integer(242,'1;o','en-US') 242nd
cy
Welsh
132 format-integer(132,'1;o','cy') 132.
es-PT
Portugese
242 format-integer(242,'1;o','es-PT') 242.º

In all cases it appears to be that the numbering scheme names and their behaviour is taken from the base language, with no regional variation - i.e. es-PT and es-419 would produce the same results.

To invoke one of these schemes, an IETF BCP47 standard private tag is appended to the language tag, with format -x-code . With a very small number of exceptions (to avoid clashes) these codes are encoded as the sequence of first letters of each word, the result being all lower case. Examples of use are shown in the previous table. These private tag codes are recognised for 'word' numbering purposes on both format-integer() and xsl:number language arguments.

A full list of the scheme codes and their support in a given language is given in Supported ICU Numbering Schemes.

In the absence of such a private tag, the following strategies are adopted:

Cardinal spellout
When directed to generate a cardinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-cardinal-verbose, spellout-cardinal, spellout-cardinal-native, spellout-cardinal-neuter, spellout-cardinal-feminine, spellout-cardinal-masculine. It appears that within ICU all languages contain at least one of these schemes, but if not, any scheme whose name matches the regular expression ^%spellout-cardinal is used (choosing the first provided for the locale).
Ordinal spellout
When directed to generate an ordinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-ordinal-verbose, spellout-ordinal, spellout-ordinal-native, spellout-ordinal-neuter, spellout-ordinal-feminine, spellout-ordinal-masculine. In the absence of any of these schemes, any scheme whose name matches the regular expression ^%spellout-ordinal is used (choosing the first provided for the locale). In the case of there being no ordinal scheme available for the locale (or a language that does not have ordinals) the default cardinal scheme is used.
Ordinal digits
When producing an ordinal digit suffix (e.g. 13.º in Spanish), the 'digit-ordinal' ruleset is used by default - for those cases where there are specialist forms (e.g. Catalan and Spanish), the private tag must be set to get the specialist behaviour.

English numbering

ICU renders 22 as words as "twenty-two", whereas the Saxon-HE default (not using ICU) is "twenty two", with a space rather than hyphen separator. At present the ICU result for all English schemes (cardinal and ordinal) is modified to use a space separator.

The default use of a '-verbose' scheme means that spellout of 118 yields "one hundred and eighteen", following the British usage rather than "one hundred eighteen", which is the (US) shortened form.

Both of these behaviours may eventually be configurable...

Using format-picture options within spelled-out numbering

format-number() and format-integer() can support further implementation-dependent control of numbering though parameters attached to cardinal (c) and ordinal (o) modifiers, e.g. format-integer(1,'Ww;o(-er)','de') which is intended to produce "Erster" in the recommended approach.

In Saxon-PE/EE, there are two forms of such parameterisation supported for spelled-out (i.e. W|w) formats:

For German (lang="de"), ICU does not provide case/gender-variable ordinal word numbering (i.e. only "Erste" and not "Erster"). The Saxon implementation supports the -suffix ordinal option described above, which replaces the trailing 'e' on the generated ordinal. Thus format-integer(1,'w;o(-en)','de') produces "ersten".

Dates

ICU also provides facilities for localized date formatting, principally for names of days of the week and months, though a variety of calendars and epoch naming facilities are also available. In Saxon-PE/EE naming of months and days (through picture fields on format-date()) is localised through the local language in scope. These appear to be all based on the base language, with no regional variations.

As is required from the specification, when the language locale requested is not implemented, the result of format-date() or format-dateTime() is prefixed with "[Language: default language code]".

Title Case

Protocols for Title Case of sequences of words can differ markedly between languages, with many keeping strictly to lower-case throughout, and a very few (such as Dutch with 'iJ') having very specialist rules. In Saxon, when title case is requested (e.g. Ww in format-integer() or [MNn] in format-date()) the Saxon implementation follows these rules:

Note that this may have problems, i.e. a title case could be forced on a language that otherwise might only ever use a uniform case. If you discover issues in a language you are using, please let us know.