XML Parsing and Serialization

Parsing

If a SAXSource containing an XMLReader is supplied to Saxon, Saxon now respects the ErrorHandler associated with the XMLReader rather than replacing it with its own.

Serialization

Some very basic support for HTML 5 has been added. If the serialization method is "html" and the version is "5.0", a heading <!DOCTYPE HTML> will be output regardless of the doctype-system and doctype-public properties.

A new serialization option saxon:recognize-binary has been added for use with the text output method (only). If set to yes, the processing instructions <?hex XXXX?> and <?b64 XXXX?> will be recognized; the value is taken as a hexBinary or base64 representation of a character string, encoded using the encoding in use by the serializer, and this character string will be output without validating it to ensure it contains valid XML characters. This enables non-XML characters, notably binary zero, to be output. For example, <?hex 0c?> outputs an ASCII form feed. Also recognized are <?hex.EEEE XXXX?> and <?b64.EEEE XXXX?>, where EEEE is the name of the encoding of the base64 or hexBinary data: for example hex.ascii or b64.utf8.

A new UTF8 writer, contributed by Tatu Saloranta, is used in place of the standard Java UTF8 writer. The effect is to speed up serialization by around 20%; for a transformation that copies its input to its output, the improvement is about 10% overall.