public class REMatcher
extends java.lang.Object
To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:
boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".
If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression boolean matched = r.match("xaaaab"); // Match against "xaaaab" String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab' String insideParens = r.getParen(1); // insideParens will be 'aaaa' int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1 int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6 int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5 int startInside = r.getParenStart(1); // startInside will be index 1 int endInside = r.getParenEnd(1); // endInside will be index 5 int lenInside = r.getParenLength(1); // lenInside will be 4
You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is as defined in the XSD 1.1 specification, modified by the XPath 2.0 or 3.0 specifications.
Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:
RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:
// Pre-compiled regular expression "a*b" char[] re1Instructions = { 0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041, 0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047, 0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000, 0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000, 0x0000, }; REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.
ISSUES:
This library is based on the Apache Jakarta regex library as downloaded on 3 January 2012. Changes have been made to make the grammar and semantics conform to XSD and XPath rules; these changes are listed in source code comments in the RECompiler source code module.
RECompiler
Modifier and Type | Class and Description |
---|---|
static class |
REMatcher.State |
Constructor and Description |
---|
REMatcher(REProgram program)
Construct a matcher for a pre-compiled regular expression from program
(bytecode) data.
|
Modifier and Type | Method and Description |
---|---|
boolean |
anchoredMatch(UnicodeString search)
Tests whether the regex matches a string in its entirety, anchored
at both ends
|
REMatcher.State |
captureState() |
protected void |
clearCapturedGroupsBeyond(int pos)
Clear any captured groups whose start position is at or beyond some specified position
|
UnicodeString |
getParen(int which)
Gets the contents of a parenthesized subexpression after a successful match.
|
int |
getParenCount()
Returns the number of parenthesized subexpressions available after a successful match.
|
int |
getParenEnd(int which)
Returns the end index of a given paren level.
|
int |
getParenStart(int which)
Returns the start index of a given paren level.
|
REProgram |
getProgram()
Returns the current regular expression program in use by this matcher object.
|
boolean |
match(java.lang.String search)
Matches the current regular expression program against a String.
|
boolean |
match(UnicodeString search,
int i)
Matches the current regular expression program against a character array,
starting at a given index.
|
protected boolean |
matchAt(int i,
boolean anchored)
Match the current regular expression program against the current
input string, starting at index i of the input string.
|
java.lang.CharSequence |
replace(UnicodeString in,
UnicodeString replacement)
Substitutes a string for this regular expression in another string.
|
void |
resetState(REMatcher.State state) |
protected void |
setParenEnd(int which,
int i)
Sets the end of a paren level
|
protected void |
setParenStart(int which,
int i)
Sets the start of a paren level
|
void |
setProgram(REProgram program)
Sets the current regular expression program used by this matcher object.
|
java.util.List<UnicodeString> |
split(UnicodeString s)
Splits a string into an array of strings on regular expression boundaries.
|
public REMatcher(REProgram program)
program
- Compiled regular expression programRECompiler
public void setProgram(REProgram program)
program
- Regular expression program compiled by RECompiler.RECompiler
,
REProgram
public REProgram getProgram()
setProgram(net.sf.saxon.regex.REProgram)
public int getParenCount()
public UnicodeString getParen(int which)
which
- Nesting level of subexpressionpublic final int getParenStart(int which)
which
- Nesting level of subexpressionpublic final int getParenEnd(int which)
which
- Nesting level of subexpressionprotected final void setParenStart(int which, int i)
which
- Which paren leveli
- Index in input arrayprotected final void setParenEnd(int which, int i)
which
- Which paren leveli
- Index in input arrayprotected void clearCapturedGroupsBeyond(int pos)
pos
- the specified positionprotected boolean matchAt(int i, boolean anchored)
i
- The input string index to start matching atanchored
- true if the regex must match all characters up to the end of the stringpublic boolean anchoredMatch(UnicodeString search)
search
- the string to be matchedpublic boolean match(UnicodeString search, int i)
search
- String to match againsti
- Index to start searching atpublic boolean match(java.lang.String search)
search
- String to match againstpublic java.util.List<UnicodeString> split(UnicodeString s)
Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.
s
- String to split on this regular exressionpublic java.lang.CharSequence replace(UnicodeString in, UnicodeString replacement)
It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
Note: $0 represents the whole match.
in
- String to substitute withinreplacement
- String to substitute for matches of this regular expressionpublic REMatcher.State captureState()
public void resetState(REMatcher.State state)
Copyright (c) 2004-2020 Saxonica Limited. All rights reserved.