Saxon implements the collection()
function by passing the given URI
(or null, if the default collection is requested) to a user-provided CollectionURIResolver
.
This section describes how the standard collection resolver behaves, if no user-written collection
resolver is supplied.
The default collection resolver returns the empty sequence as the default collection. The only
way of specifying a default collection it to provide your own CollectionURIResolver
.
If a collection URI is provided, Saxon attempts to dereference it. What happens next depends on whether the URI identifies a file or a directory.
Using catalog files
If a file is identified, Saxon treats this as a catalog file. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:
<collection stable="true">
<doc href="dir/chap1.xml"/>
<doc href="dir/chap2.xml"/>
<doc href="dir/chap3.xml"/>
<doc href="dir/chap4.xml"/>
</collection>
The stable
attribute indicates whether the collection is stable or not. The default value is true.
If a collection is stable, then the URIs listed in the doc
elements are treated like URIs passed
to the doc()
function. Each URI is first looked up in the document pool to see if it is already loaded;
if it is, then the document node is returned. Otherwise the URI is passed to the registered URIResolver
,
and the resulting document is added to the document pool. The effect of this process is firstly, that two calls on
the collection()
function passing the same collection URI will return the same nodes each time, and secondly,
that these results are consistent with the results of the doc()
function: if the document-uri()
of
a node returned by the collection()
function is passed to the doc()
function, the original node
will be returned. If stable="false"
is specified, however, the URI is dereferenced directly, and the document
is not added to the document pool, which means that a subsequent retrieval of the same document will not return the same node.
Processing directories
If the URI passed to the collection()
function (still assuming a default CollectionURIResolver
)
identifies a directory, then the contents of the directory are returned. Such a URI may have a number of query parameters,
written
in the form file:///a/b/c/d?keyword=value;keyword=value;...
. The recognized keywords and their values are as follows:
keyword |
values |
effect |
recurse |
yes | no (default no) |
determine whether subdirectories are searched recursively |
strip-space |
yes | ignorable | no |
determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings. |
validation |
strip | preserve | lax | strict |
determines whether and how schema validation is applied to each document. The default depends on the Configuration settings. |
select |
file name pattern |
determines which files are selected (see below) |
on-error |
fail | warning | ignore |
determines the action to be taken if one of the documents cannot be successfully parsed |
parser |
Java class name |
class name of the Java XMLReader to be used. For example, John Cowan's TagSoup parser may be selected by specifying
|
xinclude |
yes | no |
determines whether XInclude processing should be applied to the selected documents. This overrides
any setting in the |
unparsed |
yes | no (default no) |
determine whether the files contain unparsed text. If unparsed=yes is specified, the files are read as text using the platform default encoding. An error occurs if they contain characters that are not legal in XML. The parameters that are specific to XML, such as strip-space, parser, and validation are ignored. The function returns a document node representing each file; the document node holds a single text node containing the file contents, and the document-uri() function returns the URI of the corresponding file. |
The pattern used in the select
parameter can take the conventional form, for example *.xml
selects
all files with extension "xml". More generally, the pattern is converted to a regular expression by
prepending "^", appending "$", replacing "." by "\.", and replacing "*" by ".*",
and it is then used to match the file names appearing in the directory using
the Java regular expression rules. So, for example, you can write ?select=*.(xml|xhtml)
to match files
with either of these two file extensions. Note however, that special characters used in the URL (that is, characters
with a special meaning in regular expressions) may need
to be escaped using the %HH convention. For example, vertical bar needs to be written as %7C
.
This escaping can be achieved using the iri-to-uri() function.
A collection read in this way is not stable. Calling the collection()
function again with the same URI will
reprocess the directory, and return a different set of document nodes, even if the contents of the directory have
not changed.
Registered Collections
On the .NET product there is a third way to use a collection URI (provided that you use the API rather than
the command line): you can register a collection using the Processor.RegisterCollection
method on the Saxon.Api.Processor
class.