saxon:parse-html
Parses HTML supplied as a string.
parse-html($html as xs:string) ➔ document-node()
Arguments | |||
| $html | xs:string | The HTML content as a string |
Result | document-node() |
Namespace
http://saxon.sf.net/
Saxon availability
Requires Saxon-PE or Saxon-EE.
Notes on the Saxon implementation
Available since Saxon 9.2. Reimplemented in Saxon 12.
Details
This function takes a single argument, a string containing the source text of an HTML
document. It returns the document node (root node) that results from parsing this text
using the validator.nu
parser on Java and the AngleSharp
parser on .NET.
On the Java platform, the validator.nu
jar file must be on the classpath. It may be
downloaded from Maven.
For SaxonCS, the HTML is parsed using AngleSharp
, which
is registered as a dependency and will normally be installed automatically by nuget.
This function is useful where an HTML document is embedded inside another using CDATA. It can also be used in conjunction with the unparsed-text() function to read HTML from filestore. Note that the base URI of the document is not retained in this case.
Because different parsers are used, there are minor differences depending on the platform, but these generally occur only in edge cases, such as use of element or attribute names containing characters disallowed in XML.