Numbers and Dates from ICU

ICU - International Components for Unicode (ICU) provides extensive facilities for localized numbering and date formatting, which are supported in Saxon-PE and -EE from version 9.6.

The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site.

In the case that the ICU features have not been loaded within Saxon-PE/EE, support for numbering and dates for Danish, Dutch, Flemish, French (and Belgian French), German, Italian and Swedish is provided as detailed in the table of Numberings for selected languages.

Numbering

ICU supports a large set of language-specific rulesets for supporting different forms of spelled-out numbering and digit-ordinal treatment. For example:

Language	Number	Ruleset	XPath	Result
`en` English	1123	%spellout-cardinal	`format-integer(1123,'Ww','en-x-sc')`	One Thousand One Hundred Twenty Three
`en-US` English(US)	242	%spellout-ordinal-verbose	`format-integer(242,'Ww','en-US-x-sov')`	Two Hundred and Forty Second
`cy` Welsh	132	%spellout-cardinal-feminine	`format-integer(132,'Ww','cy-x-scf')`	Un Cant Tri Deg Dwy
`es-PT` Portugese	242	%spellout-ordinal-feminine	`format-integer(242,'Ww','es-PT-x-sof')`	Ducentésima Cuadragésima Segunda
`en` English	1123		`format-integer(1123,'1;o','en')`	1123rd
`en-US` English(US)	242		`format-integer(242,'1;o','en-US')`	242nd
`cy` Welsh	132		`format-integer(132,'1;o','cy')`	132.
`es-PT` Portugese	242		`format-integer(242,'1;o','es-PT')`	242º

In all cases it appears to be that the numbering scheme names and their behaviour is taken from the base language, with no regional variation - i.e. es-PT and es-419 would produce the same results.

To invoke one of these schemes, an IETF BCP47 standard private tag is appended to the language tag, with format -x-code. With a very small number of exceptions (to avoid clashes) these codes are encoded as the sequence of first letters of each word, the result being all lower case. Examples of use are shown in the previous table. These private tag codes are recognised for 'word' numbering purposes on both format-integer() and xsl:number language arguments.

A full list of the scheme codes and their support in a given language is given in Supported ICU Numbering Schemes.

In the abscence of such a private tag, the following strategies are adopted:

Cardinal spellout: When directed to generate a cardinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-cardinal-verbose, spellout-cardinal, spellout-cardinal-native, spellout-cardinal-neuter, spellout-cardinal-feminine, spellout-cardinal-masculine. It appears that within ICU all languages contain at least one of these schemes, but if not, any scheme whose name matches the regular expression ^%spellout-cardinal is used (choosing the first provided for the locale).
Ordinal spellout: When directed to generate an ordinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-ordinal-verbose, spellout-ordinal, spellout-ordinal-native, spellout-ordinal-neuter, spellout-ordinal-feminine, spellout-ordinal-masculine. In the absence of any of these schemes, any scheme whose name matches the regular expression ^%spellout-ordinal is used (choosing the first provided for the locale). In the case of there being no ordinal scheme available for the locale (or a language that does not have ordinals) the default cardinal scheme is used.
Ordinal digits: When producing an ordinal digit suffix (e.g. 13º in Spanish), the 'digit-ordinal' ruleset is used by default - for those cases where there are specialist forms (e.g. Catalan and Spanish), the private tag must be set to get the specialist behaviour.

English numbering

ICU renders 22 as words as "twenty-two", whereas the Saxon-HE default (not using ICU) is "twenty two", with a space rather than hyphen separator. At present the ICU result for all English schemes (cardinal and ordinal) is modified to use a space separator.

The default use of a '-verbose' scheme means that spellout of 118 yields "one hundred and eighteen", following the British usage rather than "one hundred eighteen", which is the (US) shortened form.

Both of these behaviours may eventually be configurable...

Using format-picture options within spelled-out numbering

format-number() and format-integer() can support further implementation-dependent control of numbering though parameters attached to cardinal (c) and ordinal (o) modifiers, e.g. format-integer(1,'Ww;o(-er)','de') which is intended to produce "Erster" in the recommended approach.

In Saxon-PE/EE, there are two forms of such parameterisation supported for spelled-out (i.e. W|w) formats:

-suffix . This requires a numbering of the given sort (cardinal or ordinal) which ends in the suffix given, if one is available for the locale in the implementation.
2=example . This requires a numbering of the given sort (cardinal or ordinal) using the numbering scheme for which 2 would be formatted as example (ignoring case), if one is available for the locale in the implementation. For example w;o(2=Secundo) in Italian should yield "centoventitreesimo" for 123, whereas with w;o(2=Seconda) the result is "centoventitreesima".

For German (lang="de"), ICU does not provide case/gender-variable ordinal word numbering (i.e. only "Erste" and not "Erster"). The Saxon implementation supports the -suffix ordinal option described above, which replaces the trailing 'e' on the generated ordinal. Thus format-integer(1,'w;o(-en)','de') produces "ersten".

Dates

ICU also provides facilities for localized date formatting, principally for names of days of the week and months, though a variety of calendars and epoch naming facilities are also available. In Saxon-PE/EE naming of months and days (through picture fields on format-date()) is localised through the local language in scope. These appear to be all based on the base language, with no regional variations.

As is required from the specification, when the language locale requested is not implemented, the result of format-date() or format-dateTime() is prefixed with "[Language: default language code]".

Title Case

Protocols for Title Case of sequences of words can differ markedly between languages, with many keeping strictly to lower-case throughout, and a very few (such as Dutch with 'iJ') having very specialist rules. In Saxon, when title case is requested (e.g. Ww in format-integer() or [MNn] in format-date()) the Saxon implementation follows these rules:

If the case requested is lower or upper, all words in the return from ICU are forced to the appropriate case.
If the case requested is title then the returned string from ICU is examined:
- If the first two characters are uppercase followed by lowercase, then the return is inferred already to be in title case and the result is returned unmodified.
- If not, then any letter preceded by a non-letter (or at start of string) is forced to uppercase, other letters are forced to lowercase.
- Some obvious 'joiner' words (e.g. 'and' in English, 'ën' in Dutch) that would otherwise be title-cased, are forced to lower-case.

Note that this may have problems, i.e. a title case could be forced on a language that otherwise might only ever use a uniform case. If you discover issues in a language you are using, please let us know.