Collections
Saxon implements the collection()
function by passing the given URI (or null, if
the default collection is requested) to a user-provided CollectionURIResolver
. This section
describes how the standard collection resolver behaves, if no user-written collection resolver
is supplied.
The default collection resolver returns the empty sequence as the default collection. The
only way of specifying a non-empty default collection is to provide your own
CollectionURIResolver
.
If a collection URI is provided, Saxon attempts to dereference it. What happens next depends on whether the URI identifies a file or a directory.
Defining a collection using a catalog file
If the collection URI identifies a file, Saxon treats this as a catalog file. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:
<collection stable="true"> <doc href="dir/chap1.xml"/> <doc href="dir/chap2.xml"/> <doc href="dir/chap3.xml"/> <doc href="dir/chap4.xml"/> </collection>The stable
attribute indicates whether the collection is stable or not. The
default value is true
. If a collection is stable, then the URIs listed in the
doc
elements are treated like URIs passed to the doc()
function.
Each URI is first looked up in the document pool to see if it is already loaded; if it is,
then the document node is returned. Otherwise the URI is passed to the registered
URIResolver
, and the resulting document is added to the document pool. The
effect of this process is firstly, that two calls on the collection()
function
passing the same collection URI will return the same nodes each time, and secondly, that these
results are consistent with the results of the doc()
function: if the
document-uri()
of a node returned by the collection()
function is
passed to the doc()
function, the original node will be returned. If
stable="false"
is specified, however, the URI is dereferenced directly, and the
document is not added to the document pool, which means that a subsequent retrieval of the
same document will not return the same node.
Processing directories
If the URI passed to the collection()
function (still assuming a default
CollectionURIResolver
) identifies a directory, then the contents of the
directory are returned. Such a URI may have a number of query parameters, written in the form
file:///a/b/c/d?keyword=value;keyword=value;...
. The recognized keywords and
their values are as follows:
keyword |
values |
effect |
recurse |
yes | no (default no) |
Determines whether subdirectories are searched recursively. |
strip-space |
yes | ignorable | no |
Determines whether whitespace text nodes are to be stripped. The default depends on
the |
validation |
strip | preserve | lax | strict |
Determines whether and how schema validation is applied to each document. The
default depends on the |
select |
file name pattern |
Determines which files are selected (see below). |
on-error |
fail | warning | ignore |
Determines the action to be taken if one of the documents cannot be successfully parsed. |
parser |
Java class name |
Class name of the Java |
xinclude |
yes | no |
Determines whether XInclude processing should be applied to the selected documents.
This overrides any setting in the |
unparsed |
yes | no (default no) |
Determines whether the files contain unparsed text. If |
The pattern used in the select
parameter can use glob-like syntax, for example
*.xml
selects all files with extension "xml". More generally, the pattern is
converted to a regular expression by prepending "^
", appending "$
",
replacing ".
" by "\.
", and replacing "*
" by
".*
", and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can write
?select=*.(xml|xhtml)
to match files with either of these two file extensions.
Note however, that special characters used in the URL (that is, characters with a special
meaning in regular expressions) may need to be escaped using the %HH convention. For example,
vertical bar needs to be written as %7C
. This escaping can be achieved using the
iri-to-uri()
function.
A collection read in this way is not stable. Calling the collection()
function
again with the same URI will reprocess the directory, and return a different set of document
nodes, even if the contents of the directory have not changed.
Registered Collections
On the .NET product there is a third way to use a collection URI (provided that you use the
API rather than the command line): you can register a collection using the
Processor.RegisterCollection
method on the Saxon.Api.Processor
class.