Numbers and Dates from ICU
ICU - International Components for Unicode (ICU) provides extensive facilities for localized numbering and date formatting, which are supported in Saxon-PE and -EE from version 9.6.
The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site.
In the case that the ICU features have not been loaded within Saxon-PE/EE, support for numbering and dates for Danish, Dutch, Flemish, French (and Belgian French), German, Italian and Swedish is provided as detailed in the table of Numberings for selected languages.
Numbering
ICU supports a large set of language-specific rulesets for supporting different forms of spelled-out numbering and digit-ordinal treatment. For example:
Language | Number | Ruleset | XPath | Result |
en
English |
1123 | %spellout-cardinal |
format-integer(1123,'Ww','en-x-sc')
|
One Thousand One Hundred Twenty Three |
en-US
English(US) |
242 | %spellout-ordinal-verbose |
format-integer(242,'Ww','en-US-x-sov')
|
Two Hundred and Forty Second |
cy
Welsh |
132 | %spellout-cardinal-feminine |
format-integer(132,'Ww','cy-x-scf')
|
Un Cant Tri Deg Dwy |
es-PT
Portugese |
242 | %spellout-ordinal-feminine |
format-integer(242,'Ww','es-PT-x-sof')
|
Ducentésima Cuadragésima Segunda |
en
English |
1123 |
format-integer(1123,'1;o','en')
|
1123rd | |
en-US
English(US) |
242 |
format-integer(242,'1;o','en-US')
|
242nd | |
cy
Welsh |
132 |
format-integer(132,'1;o','cy')
|
132. | |
es-PT
Portugese |
242 |
format-integer(242,'1;o','es-PT')
|
242.º |
In all cases it appears to be that the numbering scheme names and their behaviour is
taken from the base language, with no regional variation - i.e. es-PT
and
es-419
would produce the same results.
To invoke one of these schemes, an IETF BCP47 standard private tag is appended
to the language tag, with format -x-code
. With a very small number
of exceptions (to avoid clashes) these codes are encoded as the sequence of first letters
of each word, the result being all lower case. Examples of use are shown in the previous
table. These private tag codes are recognised for 'word' numbering purposes on both
format-integer() and
xsl:number language arguments.
A full list of the scheme codes and their support in a given language is given in Supported ICU Numbering Schemes.
In the absence of such a private tag, the following strategies are adopted:
- Cardinal spellout
- When directed to generate a cardinal number using the 'w' patterns, the first of the
following schemes is used, if found:
spellout-cardinal-verbose
,spellout-cardinal
,spellout-cardinal-native
,spellout-cardinal-neuter
,spellout-cardinal-feminine
,spellout-cardinal-masculine
. It appears that within ICU all languages contain at least one of these schemes, but if not, any scheme whose name matches the regular expression^%spellout-cardinal
is used (choosing the first provided for the locale). - Ordinal spellout
- When directed to generate an ordinal number using the 'w' patterns, the first of the
following schemes is used, if found:
spellout-ordinal-verbose
,spellout-ordinal
,spellout-ordinal-native
,spellout-ordinal-neuter
,spellout-ordinal-feminine
,spellout-ordinal-masculine
. In the absence of any of these schemes, any scheme whose name matches the regular expression^%spellout-ordinal
is used (choosing the first provided for the locale). In the case of there being no ordinal scheme available for the locale (or a language that does not have ordinals) the default cardinal scheme is used. - Ordinal digits
- When producing an ordinal digit suffix (e.g. 13.º in Spanish), the 'digit-ordinal' ruleset is used by default - for those cases where there are specialist forms (e.g. Catalan and Spanish), the private tag must be set to get the specialist behaviour.
English numbering
ICU renders 22
as words as "twenty-two", whereas the Saxon-HE default (not
using ICU) is "twenty two", with a space rather than hyphen separator. At present the ICU
result for all English schemes (cardinal and ordinal) is modified to use a space
separator.
The default use of a '-verbose' scheme means that spellout of 118
yields
"one hundred and eighteen", following the British usage rather than
"one hundred eighteen", which is the (US) shortened form.
Both of these behaviours may eventually be configurable...
Using format-picture options within spelled-out numbering
format-number()
and format-integer()
can support further
implementation-dependent control of numbering though parameters attached to cardinal
(c
) and ordinal (o
) modifiers, e.g.
format-integer(1,'Ww;o(-er)','de')
which is intended to produce "Erster" in
the recommended approach.
In Saxon-PE/EE, there are two forms of such parameterisation supported for spelled-out
(i.e. W|w
) formats:
-
-suffix
. This requires a numbering of the given sort (cardinal or ordinal) which ends in the suffix given, if one is available for the locale in the implementation. -
2=example
. This requires a numbering of the given sort (cardinal or ordinal) using the numbering scheme for which2
would be formatted as example (ignoring case), if one is available for the locale in the implementation. For examplew;o(2=Secundo)
in Italian should yield "centoventitreesimo" for 123, whereas withw;o(2=Seconda)
the result is "centoventitreesima".
For German (lang="de"
), ICU does not provide case/gender-variable ordinal
word numbering (i.e. only "Erste" and not "Erster"). The Saxon implementation supports the
-suffix
ordinal option described above, which
replaces the trailing 'e' on the generated ordinal. Thus
format-integer(1,'w;o(-en)','de')
produces
"ersten".
Dates
ICU also provides facilities for localized date formatting, principally for names of days of the week and months, though a variety of calendars and epoch naming facilities are also available. In Saxon-PE/EE naming of months and days (through picture fields on format-date()) is localised through the local language in scope. These appear to be all based on the base language, with no regional variations.
As is required from the specification, when the language locale requested is not
implemented, the result of format-date()
or format-dateTime()
is
prefixed with "[Language: default language code]".
Title Case
Protocols for Title Case of sequences of words can differ markedly between languages,
with many keeping strictly to lower-case throughout, and a very few (such as Dutch with
'iJ') having very specialist rules. In Saxon, when title case is requested (e.g.
Ww
in format-integer()
or [MNn]
in
format-date()
) the Saxon implementation follows these rules:
- If the case requested is
lower
orupper
, all words in the return from ICU are forced to the appropriate case. - If the case requested is
title
then the returned string from ICU is examined:- If the first two characters are uppercase followed by lowercase, then the return is inferred already to be in title case and the result is returned unmodified.
- If not, then any letter preceded by a non-letter (or at start of string) is forced to uppercase, other letters are forced to lowercase.
- Some obvious 'joiner' words (e.g. 'and' in English, 'ën' in Dutch) that would otherwise be title-cased, are forced to lower-case.
Note that this may have problems, i.e. a title case could be forced on a language that otherwise might only ever use a uniform case. If you discover issues in a language you are using, please let us know.