Package net.sf.saxon.str
Class ZenoString
- java.lang.Object
-
- net.sf.saxon.str.UnicodeString
-
- net.sf.saxon.str.ZenoString
-
- All Implemented Interfaces:
java.lang.Comparable<UnicodeString>
,AtomicMatchKey
public class ZenoString extends UnicodeString
A ZenoString is an implementation of UnicodeString that comprises a list of segments representing substrings of the total string. By convention the segments are not themselves ZenoStrings, so the structure is a shallow tree. An index holds pointers to the segments and their offsets within the string as a whole; this is used to locate the codepoint at any particular location in the string.The segments will always be non-empty. An empty string contains no segments.
The key to the performance of the data structure (and its name) is the algorithm for consolidating segments when strings are concatenated, so as to keep the number of segments increasing logarithmically with the string size, with short segments at the extremities to allow efficient further concatenation at the ends.
For further details see the paper by Michael Kay at Balisage 2021.
-
-
Field Summary
Fields Modifier and Type Field Description static ZenoString
EMPTY
An empty ZenoString
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
codePointAt(long index)
Get the code point at a given position in the stringIntIterator
codePoints()
Get an iterator over the code points present in the string.ZenoString
concat(UnicodeString other)
Concatenate another stringstatic UnicodeString
concatSegments(UnicodeString left, UnicodeString right)
java.util.List<java.lang.Long>
debugSegmentLengths()
This method is for diagnostics and unit testing only: it exposes the lengths of the internal segments.UnicodeString
economize()
Get an equivalent UnicodeString that uses the most economical representation availableint
getWidth()
Get the number of bits needed to hold all the characters in this stringlong
indexOf(int codePoint, long from)
Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the stringlong
indexWhere(java.util.function.IntPredicate predicate, long from)
Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the stringboolean
isEmpty()
Ask whether the string is emptylong
length()
Get the length of the stringstatic ZenoString
of(UnicodeString content)
Construct a ZenoString from a supplied UnicodeStringUnicodeString
substring(long start, long end)
Get a substring of this codepoint sequence, with a given start and end positionjava.lang.String
toString()
void
writeSegments(UnicodeWriter writer)
Write each of the segments in turn to a UnicodeWriter-
Methods inherited from class net.sf.saxon.str.UnicodeString
asAtomic, checkSubstringBounds, compareTo, equals, estimatedLength, hashCode, hasSubstring, indexOf, indexOf, length32, prefix, requireInt, substring, tidy, verifyCharacters
-
-
-
-
Field Detail
-
EMPTY
public static final ZenoString EMPTY
An empty ZenoString
-
-
Method Detail
-
of
public static ZenoString of(UnicodeString content)
Construct a ZenoString from a supplied UnicodeString- Parameters:
content
- the supplied UnicodeString- Returns:
- the resulting ZenoString
-
codePoints
public IntIterator codePoints()
Get an iterator over the code points present in the string.- Specified by:
codePoints
in classUnicodeString
- Returns:
- an iterator that delivers the individual code points
-
length
public long length()
Get the length of the string- Specified by:
length
in classUnicodeString
- Returns:
- the number of code points in the string
-
isEmpty
public boolean isEmpty()
Ask whether the string is empty- Overrides:
isEmpty
in classUnicodeString
- Returns:
- true if the length of the string is zero
-
getWidth
public int getWidth()
Get the number of bits needed to hold all the characters in this string- Specified by:
getWidth
in classUnicodeString
- Returns:
- 7 for ascii characters, 8 for latin-1, 16 for BMP, 24 for general Unicode.
-
indexOf
public long indexOf(int codePoint, long from)
Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the string- Specified by:
indexOf
in classUnicodeString
- Parameters:
codePoint
- the sought codePointfrom
- the position from which the search should start (0-based), in the range 0 to length()-1- Returns:
- the position (0-based) of the first occurrence found, or -1 if not found
- Throws:
java.lang.IndexOutOfBoundsException
- if thefrom
value is out of range
-
indexWhere
public long indexWhere(java.util.function.IntPredicate predicate, long from)
Description copied from class:UnicodeString
Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the string- Overrides:
indexWhere
in classUnicodeString
- Parameters:
predicate
- condition that the codepoint must satisfyfrom
- the position from which the search should start (0-based)- Returns:
- the position (0-based) of the first codepoint to match the predicate, or -1 if not found
-
codePointAt
public int codePointAt(long index)
Get the code point at a given position in the string- Specified by:
codePointAt
in classUnicodeString
- Parameters:
index
- the given position (0-based)- Returns:
- the code point at the given position
- Throws:
java.lang.IndexOutOfBoundsException
- if the index is out of range
-
substring
public UnicodeString substring(long start, long end)
Get a substring of this codepoint sequence, with a given start and end position- Specified by:
substring
in classUnicodeString
- Parameters:
start
- the start position (0-based): that is, the position of the first code point to be includedend
- the end position (0-based): specifically, the position of the first code point not to be included- Returns:
- the requested substring
-
concat
public ZenoString concat(UnicodeString other)
Concatenate another string- Overrides:
concat
in classUnicodeString
- Parameters:
other
- the string to be appended to this one- Returns:
- the result of the concatenation (neither input string is altered)
-
writeSegments
public void writeSegments(UnicodeWriter writer) throws java.io.IOException
Write each of the segments in turn to a UnicodeWriter- Parameters:
writer
- the writer to which the string is to be written- Throws:
java.io.IOException
-
concatSegments
public static UnicodeString concatSegments(UnicodeString left, UnicodeString right)
-
economize
public UnicodeString economize()
Get an equivalent UnicodeString that uses the most economical representation available- Overrides:
economize
in classUnicodeString
- Returns:
- an equivalent UnicodeString
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
debugSegmentLengths
public java.util.List<java.lang.Long> debugSegmentLengths()
This method is for diagnostics and unit testing only: it exposes the lengths of the internal segments. This is an implementation detail that is subject to change and does not affect the exposed functionality.- Returns:
- the lengths of the segments
-
-