Table of Contents
Abstract
XSLT 2.0 provides a wide range of new features, many of which make light work of tasks that are notoriously difficult in XSLT 1.0, such as grouping and string manipulation. This paper attempts to show how these facilities not only make coding easier, but will also extend the scope of the language making it possible to tackle problems that were quite outside the range of XSLT 1.0.
The paper shows case study of a multi-phase transformation taking data from a legacy ASCII-based interchange format, to XML based on a standardized vocabulary. The transformations illustrate the power of new features including regular expression handling, grouping, recursive functions, and schema-aware processing.
The conclusion of the paper is that these new facilities - notably regular expression handling and grouping - take XSLT into new territory, where languages such as Perl previously reigned supreme. XSLT 1.0 works best where all the structure in a document is already identified by markup. XSLT 2.0 will also be able to handle many situations where the structure is implicit in the text, or in markup designed for presentation purposes rather than to capture the information semantics. It thus becomes a powerful tool for "up-conversion" applications. These facilities work will in conjunction with schema-aware processing, where the aim of the exercise is to create XML that conforms to a target schema.
XSLT 1.0 became a W3C Recommendation in November 1999; it has attracted at least twenty implementations and a very sizeable user base. It is used mainly for two distinct applications: rendering of XML documents by converting them into a presentation-oriented vocabulary (usually HTML, sometimes XSL-FO, XHTML, or SVG); and conversion of data-oriented XML messages, either into a different vocabulary, or to a different document using the same vocabulary, but with different information content. Within these two categories there are some highly creative and innovative applications, a notable example being Schematron, which uses XSLT transformations to apply structural and semantic validation rules to a document.
Although XSLT 1.0 is designed to transform source XML trees into result XML trees, it also includes three serialization methods, allowing the result tree to be output either as lexical XML, HTML, or text. This enables a wide range of applications in which the output is in textual form: I have seen XSLT stylesheets that generated Java programs, SQL code, comma-separated-values files, and EDI messages.
However, this ability to generate multiple output formats is not mirrored on the input side. XSLT 1.0 has very little capability to take anything other than XML as its input. There are ways around this: for example in the first edition of my book XSLT Programmer's Reference I showed how one could write a parser for a non-XML format such as the GEDCOM 5.5 format used for genealogical data, and by making this parser implement the SAX interface supported by many XSLT processors, one could present the parsed input data to the XSLT 1.0 processor as if it came from an XML parser. However, this is really only a minor improvement on what can be achieved by writing a GEDCOM-to-XML converter as a standalone application.
XSLT 2.0, as I will show in this paper, greatly extends the ability of XSLT to process any textual input, without the need to write conversion code in Java or another procedural programming language. It therefore enables XSLT to be used not only for XML-to-XML and XML-to-text applications, but also for text-to-XML conversions. More generally, it allows XSLT 2.0 to be used for up-conversion.
In the broadcasting industry, the term upconversion (usually without a hyphen) is used to mean the conversion of a low-frequency video format to an equivalent high-frequency format. In the SGML and XML world, the word refers to the generation of a format with detailed markup from a format with less-detailed or no markup, where it is necessary to generate the additional markup by recognizing structural patterns that are implicit in the textual content itself. By extension the term is also used for converting non-SGML or non-XML markup into SGML/XML: this usage is justified, of course, on the basis that SGML/XML is obviously on a higher plane than any alternative markup language!
I will start this paper with a survey of the new features in XSLT 2.0 that make it easier to write up-conversion transforms (it really doesn't make much sense to call them stylesheets any more, but I will slip into that usage occasionally). I will then present a case study of a particular up-conversion. I will use the example I mentioned earlier, conversion of GEDCOM genealogical data: but this time, the entire job will be done in XSLT 2.0, with no need to write preprocessing software in a procedural language.
In this section I will describe how four of the new features in XSLT 2.0 can be used to assist in writing up-conversion applications. The four features discussed are:
The unparsed-text() function
Regular expression processing
Grouping facilities
Schema-aware processing
The descriptions here are brief introductions to these facilities: for full information, see the W3C specifications of XSLT 2.0 [XSLT 2.0] and XPath 2.0 [XPath 2.0], or my books XSLT 2.0 Programmer's Reference [Kay, 2004a] and XPath 2.0 Programmer's Reference [Kay, 2004b].
In order to handle non-XML input, the first thing a stylesheet needs to be able to do is
to read it. For this purpose, XSLT 2.0 provides the unparsed-text() function. This takes a URI as its
first argument, and loads the text of the resource found at that URI. The result is a character
string - that is, a value of type xs:string
, where "xs" is the XML Schema namespace.
The type system of XSLT 2.0 is based on the types defined in the XML Schema specification.
In fact, it was already possible in XSLT 1.0 to provide a stylesheet with non-XML input, in the form of a string-valued stylesheet parameter (parameters can be declared using a global <xsl:param> element). However, this imposes constraints, for example it is difficult to handle a variable number of such inputs. Allowing URI-addressible resources to be accessed directly makes the job much easier.
Character encoding is of course a problem. The unparsed-text() function allows a second parameter to specify the character encoding explicitly, or it can be guessed from external information - the XSLT 2.0 spec refers to the algorithms and heuristics defined in the XLink specification for this purpose. In practice, if the file is an arbitrary file in operating system filestore with no associated metadata, guessing its encoding is sometimes going to give wrong answers. Sadly, there is no easy solution to this difficulty.
The fact that the result of the unparsed-text() function must be an xs:string
imposes a
constraint: the only characters allowed in the file are those permitted in XML documents. This
same constraint also applies to any text output produced by a stylesheet. It means that XSLT
is now capable of reading textual input and writing textual output, but it cannot be used to
handle binary input or binary output, unless these are first translated into some textual
representation.
XSLT 1.0 has been much criticized for its rather primitive text-handling capabilities: the function library provided for string handling in XPath 1.0 is designed very much on "reduced instruction set computing" principles - you can achieve pretty well anything, but the complexity of the programming needed even for some quite simple tasks can be daunting. In particular, for many users (whether or not they have a programming background), writing string manipulation routines in terms of recursive templates can present a big conceptual barrier.
I don't know the history of the decisions that brought this situation about. I have always thought the statement at the start of the XSLT 1.0 specification, to the effect that XSLT is not a general-purpose programming language, was very suggestive: committees don't put a statement like that in a specification unless there has been a vigorous debate on the matter, and the fact that the statement is there means there must have been a strong "keep it simple" camp on the working group who won the debate. Which is probably a good thing, given the length of time the world has been waiting for an XQuery recommendation.
But the fact is, there is a large class of applications for which the text processing capability in XSLT 1.0 is woefully inadequate - and this includes most up-conversion applications. XSLT 1.0 is very good at performing structural transformations - that is, at rearranging the nodes in a tree. It is much less good at manipulating the textual content of those nodes. By definition, up-conversion applications are those where the input doesn't have explicit structure, but rather has structure that is implicit in the text, and therefore they need good text processing capability.
Users of Perl and similar languages have long been accustomed to the power of regular expressions (regexes). In fact, they are so powerful they can become addictive: whereas programmers from other disciplines might turn to regular expressions as a last resort, there are Perl programmers who see almost any problem as an opportunity for creativity in their use of regexes.
XPath 2.0 offers three functions in its standard function library that perform regular expression processing. Specifically:
matches(): returns a boolean value indicating whether a particular string matches a regular expression.
replace(): replaces those substrings within a given string that match a regular expression, with a replacement string.
tokenize(): breaks a string into a sequence of substrings, based on finding delimiters or separators that match a given regular expression.
Conspicuously missing from this list is any function that allows markup to be inserted into a string.
It can be done somewhat laboriously by combining the different functions together, but using these
three functions alone to translate See [2]
into See <ref>2</ref>
is
painfully hard work. The reason for the omission is that it's hard to solve the requirement with a simple
function.
The XSLT/XQuery/XPath programming model, despite the fact that it owes a great deal to
functional programming theory, does not support higher-order functions. That is, functions are not
first-class objects and cannot be supplied as arguments to other functions. This greatly limits the
power of what can be achieved with a function library alone. All higher-order capabilities in the
three languages are instead achieved by means of higher-order operators, custom syntax, or XSLT instructions.
An example is the XPath for
expression, which in a pure functional language would be expressed
as a higher-order map
or apply
operator taking a sequence as its first argument
and a function (to be applied to each member of the sequence) as its second argument; another example
is the construct SEQ[P]
which is essentially a higher-order filter
function
that takes a sequence as its first argument and a predicate as its second.
So the XSLT solution to this problem is an instruction, xsl:analyze-string
, that
logically takes four arguments: the string to be analyzed, a regex, an instruction to
be executed to process substrings that match the regular expression, and an instruction to be
executed to process substrings that don't match. The earlier example that turns
See [2]
into See <ref>2</ref>
can then be coded as follows:
<xsl:analyze-string select="$input" regex="\[.*?\]"> <xsl:matching-substring> <ref><xsl:value-of select="translate(.,'[]', '')"/></ref> </xsl:matching-substring> <xsl:matching-substring> <xsl:value-of select="."/> </xsl:matching-substring> </xsl:analyze-string>
Those who are comfortable with regular expressions will have little difficulty following
what regex="\[.*?\]"
does: \[
matches an opening square bracket, .*
matches any sequence of
characters, the ?
is a modifier indicating that the .*
should match the shortest possible sequence
of characters consistent with the regex as a whole succeeding, and the \]
matches a closing square
bracket.
The semantics of xsl:analyze-string
are that the input string is scanned from left
to right looking for substrings that match the regex. Substrings that don't match the regex are passed
(as the context item, ".") to the xsl:non-matching-substring
instruction, which in this case copies
them unchanged, while substrings that do match the regex are passed to xsl:matching-substring
,
which in this example wraps the substring in a ref
element, using the (XSLT 1.0) translate()
function to drop the delimiting square brackets. (Regex devotees will find a different way of doing this,
but the old translate() function suits me fine.)
There is no equivalent facility to xsl:analyze-string
in XQuery. In the latest
release (version 8.1) of Saxon I have introduced an extension to support higher-order functions, and
have used this to provide an extension function saxon:analyze-string
[Saxonica, 2004] that takes as its
arguments the string to be processed, the regex, and two functions to be applied to the matching and
non-matching substrings respectively. It's not quite as convenient to use as the XSLT 2.0 construct, but
it demonstrates that if higher-order functions were available in the language, there would be a lot
less need for custom syntax to solve such problems.
Grouping problems probably form the largest category of tricky-to-solve problems faced by
XSLT 1.0 users. I classify any problem as a grouping problem if it requires the addition of an extra
layer of hierarchy in the result tree that is not present in the source tree. Grouping problems
fall essentially into two categories: those that group elements having matching data values, and those
that group elements based on their position in a sequence (for example, a heading
element
followed by all the para
elements up to the next heading
).
XSLT 1.0 offers no inbuilt support for solving grouping problems, and neither does XQuery 1.0. The standard solution for value-based grouping in XSLT 1.0 is a technique using keys, which was invented by Steve Muench of Oracle and is therefore known as Muenchian grouping: its best description is that by Jeni Tennison at [Tennison]. (Steve never published it himself: he first described it in a personal email to me, and I announced his discovery to the world. I am very pleased that he got the credit he deserved, which is unusual in our industry.) For positional grouping, a number of techniques are possible, generally involving recursive processing using the following-sibling axis. (Unfortunately neither keys nor the following-sibling axis are available in XQuery, so XQuery users are going to struggle with this one.)
XSLT 2.0 offers a new instruction, xsl:for-each-group
, to perform grouping. It provides
four ways to define the grouping criterion: simple value-based grouping (the most common requirement)
can be achieved by defining an expression to compute the grouping key, while the other
three options define various kinds of
positional grouping criteria. The body of the xsl:for-each-group
instruction is then
executed once for each group of nodes identified.
To take a simple example, the following code takes a flat list of author
elements,
and groups them so that authors with the same affiliation appear as children of an affiliation
element:
<xsl:for-each-group select="author" group-by="affiliation"> <affiliation name="{current-grouping-key()}"> <xsl:copy-of select="current-group()"/> </affiliation> </xsl:for-each-group>
What is the relevance of this to up-conversion, the subject of this paper? The answer is that up-conversion involves detection of implicit structure, and replacement of the implicit structure by explicit markup. This is exactly what grouping facilities are doing. This time, the implicit structure is not found by parsing the text, but by looking for patterns in the existing markup. This will become very clear in my case study, presented in the second half of this paper.
Like xsl:analyze-string
, the xsl:for-each-group
instruction is essentially
syntactic sugar for a higher-order function. This time you can think of it (specifically the variant for
value-based grouping) as a function whose arguments are the sequence to be grouped, a function to
calculate the grouping key, and a function to be evaluated once for each group of items in the input
sequence. So that XQuery users can take advantage of the grouping facilities in Saxon, I have again
provided a higher-order extension function in Saxon 8.1 that provides this capability: its name is
saxon:for-each-group()
[Saxonica, 2004]. As with analyze-string
, it is slightly clumsier
to use than the custom syntax provided in XSLT 2.0, but again shows how much more power there would be
in the language if higher-order functions were a standard feature.
The most radical difference between XSLT 2.0 and XSLT 1.0 is that the language has become strongly typed, with a type system based on XML Schema. This has been done in such a way that untyped (schemaless) processing is still possible as a fallback. There are many reasons this change has taken place, and much debate about the desirability of making such a radical change, especially in view of the fact that XML Schema is widely criticized both for its complexity and for the limitations in its capability. I would like to concentrate here, however, on its impact for writing up-conversion applications.
Since up-conversion often starts with an input file that is not XML, it is unlikely that an XML Schema will exist to describe its structure. Fortunately this is not a problem: XSLT is still perfectly happy to work with untyped, schemaless data.
I have often found that it is best to structure an up-conversion as a sequence of two (maybe more) transformations. The first transformation takes the raw input data in whatever legacy format it arrives in, and translates it to an XML representation that is as close to the original structure as possible, consistent with it being XML. The second transformation takes this raw XML and translates it to the desired target XML vocabulary.
The target vocabulary typically represents XML that is designed to have significant visibility: it may be long-lived, widely-shared, or both. Therefore, it is very likely that there will exist an XML Schema for this vocabulary. The schema-aware capabilities of XSLT that are relevant to up-conversion therefore tend to be those that are concerned with validating the result tree, rather than those concerned with processing the source. In the case study I will show how this validation assisted with the development process for creating correct XSLT transformations. The case study in this paper is an artificial one, it was constructed largely for pedagogic purposes, but I have had the same experiences in a real project involving the capture of human resources data from Excel spreadsheets for transfer into an XML database.
In this second part of this paper we will look at how the constructs introduced in the previous section are used in a practical example of an up-conversion exercise.
Genealogical data is interesting for a number of reasons. Genealogy is one of the most popular applications of the web for millions of people, and its success relies on the ability to exchange data between different application packages. The data itself is quite complex, for two reasons: the variety of information that people want to record, and the need to capture uncertain information and conflicting versions of events. For many years genealogical data has been exchanged using a format called GEDCOM [LDS, 1996 ], devised by the Church of Jesus Christ of Latter-Day Saints (the Mormons). GEDCOM 5.5 uses a hierarchic record format rather in the style of a COBOL data definition, typified by the following entry:
0 @I53@ INDI 1 NAME Michael Howard /KAY/ 1 SEX M 1 BIRT 2 DATE 11 OCT 1951 2 PLAC Hannover, Germany 3 MAP 4 LATI N52 4 LONG E9 1 OCCU Software Designer 2 DATE FROM 1975 TO 2004 1 EDUC Postgraduate 2 DATE FROM 1969 TO 1975 2 PLAC Cambridge, England 3 MAP 4 LATI N52 4 LONG E0 2 NOTE PhD in Computer Science 1 FAMS @F233@ 1 FAMC @F221@
The @I53@
field is a record identifier, and the values @F233@
and
@F221@
are pointers to other records (specifically, the record describing the family
in which this individual is a parent, and the record describing the family in which this individual
is a child).
This can of course be directly translated to an XML syntax, such as this:
<INDI> <NAME>Michael Howard /KAY/</NAME> <SEX>M</SEX> <BIRT> <DATE>11 OCT 1951</DATE> <PLAC>Hannover, Germany <MAP> <LATI>N52</LATI> <LONG>E9</LONG> </MAP> </PLAC> </BIRT> <OCCU>Software Designer <DATE>FROM 1975 TO 2004</DATE> </OCCU> <EDUC>Postgraduate <DATE>FROM 1969 TO 1975</DATE> <PLAC>Cambridge, England <MAP> <LATI>N52</LATI> <LONG>E0</LONG> </MAP> </PLAC> <NOTE>PhD in Computer Science</NOTE> </EDUC> <FAMS REF="F233"/> <FAMC REF="F221"/> </INDI>
The first stage of our up-conversion application will be to convert the data into this form. After that we will see how to convert it further to the actual target XML vocabulary defined by the proposed GEDCOM-XML standard.
In my book XSLT Programmer's Reference (including the latest edition for XSLT 2.0) I describe how to perform this step by writing a GEDCOM parser in Java. The fact is, however, that it can be coded entirely in XSLT 2.0, and that the XSLT 2.0 code is actually shorter than the Java implementation. Let's see what it looks like.
First we have to read the input file, which we can do like this:
<xsl:param name="input" as="xs:string" required="yes"/> <xsl:variable name="input-text" as="xs:string" select="unparsed-text($input, 'iso-8859-1')"/>
(I've actually cheated here. GEDCOM requires files to be encoded in a character set
called ANSEL, otherwise ANSI Z39.47-1985, which is used for almost no other purpose. If ANSEL
were a mainstream character encoding, it could be specified in the second argument of the
unparsed-text()
function call. In practice, however, it is rather unlikely that
any XSLT 2.0 processor would support this encoding natively. Therefore, the conversion from
ANSEL to a mainstream character encoding will still have to be done in a pre-processing phase.)
The next stage is to split the input into lines, which can be done using the XPath 2.0
tokenize()
function. Since the unparsed-text()
function does not
normalize line endings (this might yet change) the regular expression for matching the separator
between tokens accepts both UNIX and Windows line endings. The result is a sequence of strings, one
for each line of the input file:
<xsl:variable name="lines" as="xs:string*" select="tokenize($input-text, '\r?\n')"/>
Now we need to parse the individual lines. Each line in a GEDCOM file has up to five
fields: a level number, an identifier, a tag, a cross-reference, and a value. We will create
an XML line
element representing the contents of the line, using attributes to represent each of
these five components:
<xsl:variable name="parsed-lines as="element(line)*"> <xsl:for-each select="$lines"> <xsl:analyze-string select="." flags="x" regex="^([0-9]+)\s* (@([A-Za-z0-9]+)@)?\s* ([A-Za-z]*)?\s* (@([A-Za-z0-9]+)@)? (.*)$"> <xsl:matching-substring> <line level="{regex-group(1)}" ID="{regex-group(3)}" tag="{regex-group(4)}" REF="{regex-group(6)}" text="{regex-group(7)}"/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:message> Non-matching line "<xsl:value-of select="."/>" </xsl:message> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:for-each> </xsl:variable>
Note first the as
attribute on the xsl:variable
declaration.
I have consistently been declaring the types of my variables: this helps to pick up programming
errors and it documents the stylesheet for the reader. I can do this even with a non-schema-aware
stylesheet: the form element(line)*
indicates that the variable holds a sequence
of elements whose name is line
. I could further constrain them to conform to a
line
element declaration in an XML schema by writing schema-element(line)*
,
but I've chosen not to do that here, because it's too much effort to create a schema to describe
this transient data structure.
The actual content of the elements is constructed by analyzing the text of the input GEDCOM
line using a regular expression. The attribute flags="x"
allows the regex to be split
into multiple lines for readability. The five lines of the regex correspond to the five fields that
may be present. I describe this usage of xsl:analyze-string
as a "single-match" usage,
because the idea is that the regular expression matches the entire input string exactly once, and the
xsl:non-matching-substring
instruction is used only to catch errors. Within the
xsl:matching-substring
instruction, the content of the line is picked apart using the
regex-group()
function, which returns the part of the matching substring that matched
the n'th parenthesized subexpression within the regex. If the relevant part of the regex wasn't matched
(for example, if the optional identifier was absent) then this returns a zero-length string, and our
XSLT code then creates a zero-length attribute.
So we now have a sequence of XML elements each representing one line of the GEDCOM file, each containing attributes to represent the contents of the five fields in the input. The next stage is to convert this flat sequence into a hierarchy, in which level 2 lines (for example) turn into XML elements that contain the corresponding level 3 lines.
Any problem that involves adding hierarchic levels to the result tree, that were not present
in the source tree, can be regarded as a grouping problem, and it should therefore be no surprise
that we tackle it using the xsl:for-each-group
instruction. This time a group consists of
a level N element together with the following elements up to the next one at level N. So this is
a positional grouping rather than a value-based grouping. The option that we use to tackle this is
the group-starting-with
attribute, whose value is a match pattern that is used to
recognize the first element in each group.
A single application of xsl:for-each-group
creates one extra level in the result
tree. In this example, we have a variable number of levels, so we want to apply the instruction
a variable number of times. First we group the overall sequence of line
elements so that
each level 0 line starts a new group. Within this group, we perform a further grouping so that each
level 1 line starts a new group, and so on up to the maximum depth of the hierarchy. As one might expect,
the process is recursive: we write a recursive template that performs the grouping at level N, and that
calls itself to perform the level N+1 grouping. This is what it looks like:
<xsl:template name="process-level"> <xsl:param name="population" required="yes" as="element()*"/> <xsl:param name="level" required="yes" as="xs:integer"/> <xsl:for-each-group select="$population" group-starting-with="*[xs:integer(@level) eq $level]"> <xsl:element name="{@tag}"> <xsl:copy-of select="@ID[string(.)], @REF[string(.)]"/> <xsl:value-of select="normalize-space(@text)"/> <xsl:call-template name="process-level"> <xsl:with-param name="population" select="current-group()[position() != 1]"/> <xsl:with-param name="level" select="$level + 1"/> </xsl:call-template> </xsl:element> </xsl:for-each-group> </xsl:template>
When this is called to process all the line
elements with
the $level
parameter set to zero, it forms one group for each
line having the attribute level="0"
, containing that line and
all the following lines up to the next one with level="0"
. It
then processes each of these groups by creating an element to represent the
level 0 line (the name of this element is taken from the GEDCOM tag, and its
ID and IDREF attributes are copied unless they are empty), and constructs the
content of this new element by means of a recursive call, processing all elements
in the group except the first, and looking this time for level 1 lines as the
ones that start a new group. The process continues until there are no lines at the
next level (the for-each-group
instruction does nothing if the
population to be grouped is empty).
The remaining code in the stylesheet simply invokes this recursive template to process all the lines at level 0:
<xsl:template name="main"> <xsl:call-template name="process-level"> <xsl:with-param name="population" select="$parsed-lines/ged/line"/> <xsl:with-param name="level" select="0"/> </xsl:call-template> </xsl:template>
This main
template represents the entry point to the stylesheet.
There is no match="/"
template rule, because there is no source XML document
with a root node to be matched; instead, XSLT 2.0 allows a transformation to be
invoked by specifying the name of a named template where execution is to start.
I use the name main
as a matter of convention.
We have now converted the GEDCOM data to XML. The next step is to convert it to the actual XML vocabulary that the target application requires.
Like many up-conversion problems, the GEDCOM problem is best solved in two stages: the first stage is essentially a syntactic transformation of the raw data into XML, and the second stage is a semantic transformation to a different data model.
At the same time as moving to XML, the GEDCOM designers decided it was time to fix some long-standard deficiencies in the data model. The draft GEDCOM 6.0 specification [LDS, 2002 ] therefore not only moves from ANSEL character encoding to Unicode, and from COBOL-like level numbers to nested XML tags, it also changes the structure of the data. Events, for example, are now primary objects in their own right, rather than being always subsidiary to an individual or family. This reflects the fact that there is often uncertainty as to whether two events involve the same individual (rather than two distinct individuals having the same name), and it also makes it easier to record all the individuals associated with an event - for example, the witnesses at a marriage, or the godparents at a christening.
The transformation of GEDCOM 5.5 files to "raw XML", as described in the previous section, is therefore followed by a second transformation, this time to XML that conforms to the target schema defined by GEDCOM 6.0. (I'm taking it as read here that GEDCOM 6.0 exists and is stable and is worth adopting as a target. This idealizes the actual state of affairs, but the debate isn't relevant to this paper.)
Multi-phase transformations can be done in either of two ways: using a single stylesheet (typically using different modes for the two phases) or using one stylesheet for each phase. I usually find it is easier to develop them using multiple stylesheets, and then integrate them together later as a production application.
The second transformation is rather more conventional than the first, because it starts with XML as its input. I've presented the full stylesheet in XSLT 2.0 Programmer's Reference, and I won't repeat it here in full. What I would like to draw out, however, is the impact of making this stylesheet schema-aware.
The first stylesheet, presented in the previous section, didn't use an XML schema. The input isn't XML, so it clearly has no schema; and the output uses a local transient XML vocabulary where the effort of writing a schema probably isn't worthwhile. However, for the second stylesheet, the aim is to produce output that conforms to a recognized standard XML vocabulary, for which an XML Schema exists, and we clearly want to have as much confidence as we can that the stylesheet output will always conform to this target schema.
With XSLT 1.0, the way you achieve this is to run your stylesheet against as many test cases as you can, and validate the output of each test case against the target schema. If validation errors are reported, you then have to debug the stylesheet to find out why it produced incorrect output in this particular case.
It would be far better if one could determine statically, purely from examination of the stylesheet, that its output will be correct. In practice this is unlikely to be fully achievable, because of the highly dynamic nature of XSLT template rules. However, there are many errors that could in principle be detected statically, and each error that is found this way makes a significant contribution to easing the testing and debugging burden. For example, here is an extract of the second-phase GEDCOM stylesheet:
<xsl:result-document validation="strict"> <GEDCOM> <HeaderRec> <FileCreation Date="{format-date(current-date(), '[D1] [MN,*-3] [Y0001]')}"/> <Submitter> <Link Target="ContactRec" Ref="Contact-Submitter"/> </Submitter> </HeaderRec> <xsl:call-template name="families"/> <xsl:call-template name="individuals"/> <xsl:call-template name="events"/> <ContactRec Id="Contact-Submitter"> <Name><xsl:value-of select="$submitter"/></Name> </ContactRec> </GEDCOM> </xsl:result-document>
One can see many potential errors that could be detected statically by the stylesheet
compiler. It can check that there is a schema definition of the GEDCOM
element, and
that HeaderRec
and ContactRec
are permitted respectively as the
first and last child elements of the GEDCOM
element. It can check similarly
that the elements within the HeaderRec
are allowed to appear where they do, that they
are allowed to have the appropriate attributes, and that none of these elements have required
attributes which the stylesheet does not generate. In some cases the compiler can also check that the
textual content of elements and attributes is appropriate to their type. The analysis can extend
beyond the fragment shown here to the three named templates invoked by this fragment; for example
if the call on the individuals
template preceded that on the families
template, then the compiler could deduce that the stylesheet was outputting IndividualRec
elements ahead of FamilyRec
elements, which the schema does not allow.
As programmers, we are all familiar with the fact that errors detected at compile-time are much quicker to find and to fix than errors detected at run-time. This is as true for XSLT as for any other programming language.
Currently the only schema-aware XSLT processor available is my own Saxon product, and the
current release (8.0) does not yet do the kind of static checking described above. Even run-time
checking, however, can pay substantial dividends. For example, one error that I made during development was
to write an attribute of a literal result element as id="@ID"
instead of id="{@ID}"
.
Ordinarily, this would cause the result document to contain the attribute value id="@ID"
.
When the programmer gets round to validating the output (a stage which is often omitted during development
and testing) this would reveal an error, because the id
attribute is declared as having
type xs:ID
, and an @
character is not allowed in values of this type.
Running with a schema-aware processor, this error was reported as soon as the offending code in the
stylesheet was executed, with the incorrect line in the stylesheet being accurately pinpointed.
I actually found that while developing this and other similar stylesheets, the number of errors detected by validation of result trees was so large that it became a little frustrating. Sometimes one actually wants to develop a stylesheet "top-down", getting the broad structure of the output right first, and focusing on the detail later. As a response to this experience, Saxon 8.1 allows multiple validation errors in the output to be reported in a single run, and it allows you to see the (invalid) result tree that was generated, along with comments inserted into the XML showing where it is invalid and which stylesheet instructions need to be changed to fix the errors. This provides another of the benefits normally associated with compile-time errors, the ability to report many errors in a single run.
Like other new features in XSLT 2.0, such as xsl:analyze-string
and
xsl:for-each-group
, the facility to validate result documents on-the-fly is
useful for a wide range of applications, of which up-conversion applications are just one example.
But taken together, these features make a dramatic difference to the ease of developing
up-conversion applications when compared with XSLT 1.0.
The first part of this paper described four specific features of XSLT 2.0 that make it highly suitable for writing up-conversion applications, namely:
The unparsed-text() function
Regular expression processing
Grouping facilities
Schema-aware processing
The second half of the paper showed how these features can be used in a practical up-conversion exercise, the translation of GEDCOM 5.5 genealogical data to the proposed GEDCOM 6.0 XML vocabulary.
XSLT 1.0 has been widely deployed to achieve both XML-to-XML and XML-to-text transformations. The conclusion of this paper is that XSLT 2.0 is also highly suited to a wide range of text-to-XML applications, thus greatly increasing the scope of applicability of the language.
[XSLT 2.0] XSL Transformatons (XSLT) Version 2.0. W3C Working Draft 12 November 2003.
[XPath 2.0] XML Path Language (XPath) 2.0. W3C Working Draft 23 July 2004.
[Tennison] Jeni's XSLT Pages: Grouping. Jeni Tennison.
[LDS, 1996 ] The GEDCOM Standard Release 5.5. Family History Department, The Church of Jesus Christ of Latter-day Saints. January 2, 1996
[LDS, 2002 ] GEDCOM XML Specification, Release 6.0, Beta Version. Family and Church History Department, The Church of Jesus Christ of Latter-day Saints. December 6, 2002
[Saxonica, 2004] Saxon 8.1 Documentation. Go to www.saxonica.com, follow links to Documentation, then Extensions, then Extension Functions. Saxonica Limited, to be published