Package net.sf.saxon.str
Class UnicodeString
- java.lang.Object
-
- net.sf.saxon.str.UnicodeString
-
- All Implemented Interfaces:
java.lang.Comparable<UnicodeString>
,AtomicMatchKey
- Direct Known Subclasses:
BMPString
,EmptyUnicodeString
,Slice16
,Slice24
,Slice8
,StringView
,Twine16
,Twine24
,Twine8
,UnicodeChar
,WhitespaceString
,ZenoString
public abstract class UnicodeString extends java.lang.Object implements AtomicMatchKey, java.lang.Comparable<UnicodeString>
A UnicodeString is a sequence of Unicode codepoints that supports codepoint addressing.The interface is future-proofed to support code points in the range 0 to 2^31, and string lengths of up to 2^63 characters. Implementations may (and do) impose lower limits.
-
-
Constructor Summary
Constructors Constructor Description UnicodeString()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description AtomicValue
asAtomic()
Get an atomic value that encapsulates this match key.protected void
checkSubstringBounds(long start, long end)
abstract int
codePointAt(long index)
Get the code point at a given position in the stringabstract IntIterator
codePoints()
Get an iterator over the code points present in the string.int
compareTo(UnicodeString other)
Compare this string to another using codepoint comparisonUnicodeString
concat(UnicodeString other)
Concatenate with another string, returning a new stringUnicodeString
economize()
boolean
equals(java.lang.Object obj)
long
estimatedLength()
Get the estimated length of the string, suitable for space allocation.abstract int
getWidth()
Get the number of bits needed to hold all the characters in this stringint
hashCode()
Compute a hashCode.boolean
hasSubstring(UnicodeString other, long offset)
Ask whether this string has another string as its content starting at a given offsetlong
indexOf(int codePoint)
Get the position of the first occurrence of the specified codepoint, starting the search at the beginningabstract long
indexOf(int codePoint, long from)
Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the stringlong
indexOf(UnicodeString other, long from)
Get the first position, at or beyondfrom
, where another string appears as a substring of this string, comparing codepoints.long
indexWhere(java.util.function.IntPredicate predicate, long from)
Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the stringboolean
isEmpty()
Ask whether the string is emptyabstract long
length()
Get the length of the stringint
length32()
Get the length of the string, provided it is less than 2^31 charactersUnicodeString
prefix(long end)
Get a substring of this string, starting at position 0, with a given end positionstatic int
requireInt(long value)
Utility method for use where strings longer than 2^31 characters cannot yet be handled.UnicodeString
substring(long start)
Get a substring of this codepoint sequence, with a given start position, finishing at the end of the stringabstract UnicodeString
substring(long start, long end)
Get a substring of this string, with a given start and end positionUnicodeString
tidy()
Ensure that the implementation is capable of counting codepoints in the string.void
verifyCharacters()
Diagnostic method: verify that all the characters in the string are valid XML codepoints
-
-
-
Method Detail
-
tidy
public UnicodeString tidy()
Ensure that the implementation is capable of counting codepoints in the string. This is normally a null operation, but it may cause internal reorganisation.- Returns:
- this
UnicodeString
, or another that represents the same sequence of characters.
-
economize
public UnicodeString economize()
-
length
public abstract long length()
Get the length of the string- Returns:
- the number of code points in the string
-
length32
public int length32()
Get the length of the string, provided it is less than 2^31 characters- Returns:
- the length of the string if it fits within a Java
int
- Throws:
java.lang.UnsupportedOperationException
- if the string is longer than 2^31 characters
-
estimatedLength
public long estimatedLength()
Get the estimated length of the string, suitable for space allocation.- Returns:
- for a
UnicodeString
, the actual length of the string in codepoints
-
isEmpty
public boolean isEmpty()
Ask whether the string is empty- Returns:
- true if the length of the string is zero
-
getWidth
public abstract int getWidth()
Get the number of bits needed to hold all the characters in this string- Returns:
- 7 for ascii characters (not used??), 8 for latin-1, 16 for BMP, 24 for general Unicode.
-
indexOf
public long indexOf(int codePoint)
Get the position of the first occurrence of the specified codepoint, starting the search at the beginning- Parameters:
codePoint
- the sought codePoint- Returns:
- the position (0-based) of the first occurrence found, or -1 if not found, counting codePoints rather than UTF16 chars.
- Throws:
java.lang.UnsupportedOperationException
- if theUnicodeString
has not been prepared for codePoint access
-
indexOf
public abstract long indexOf(int codePoint, long from)
Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the string- Parameters:
codePoint
- the sought codePointfrom
- the position from which the search should start (0-based)- Returns:
- the position (0-based) of the first occurrence found, or -1 if not found
- Throws:
java.lang.UnsupportedOperationException
- if theUnicodeString
has not been prepared for codePoint access
-
indexWhere
public long indexWhere(java.util.function.IntPredicate predicate, long from)
Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the string- Parameters:
predicate
- condition that the codepoint must satisfyfrom
- the position from which the search should start (0-based)- Returns:
- the position (0-based) of the first codepoint to match the predicate, or -1 if not found
- Throws:
java.lang.UnsupportedOperationException
- if theUnicodeString
has not been prepared for codePoint access
-
indexOf
public long indexOf(UnicodeString other, long from)
Get the first position, at or beyondfrom
, where another string appears as a substring of this string, comparing codepoints.- Parameters:
other
- the other (sought) stringfrom
- the position (0-based) where searching is to start (counting in codepoints)- Returns:
- the first position where the substring is found, or -1 if it is not found
-
hasSubstring
public boolean hasSubstring(UnicodeString other, long offset)
Ask whether this string has another string as its content starting at a given offset- Parameters:
other
- the other stringoffset
- the starting position in this string (counting in codepoints)- Returns:
- true if the other string appears as a substring of this string starting at the given position.
-
codePoints
public abstract IntIterator codePoints()
Get an iterator over the code points present in the string.- Returns:
- an iterator that delivers the individual code points
-
codePointAt
public abstract int codePointAt(long index)
Get the code point at a given position in the string- Parameters:
index
- the given position (0-based)- Returns:
- the code point at the given position
- Throws:
java.lang.IndexOutOfBoundsException
- if the index is out of range
-
substring
public UnicodeString substring(long start)
Get a substring of this codepoint sequence, with a given start position, finishing at the end of the string- Parameters:
start
- the start position (0-based): that is, the position of the first code point to be included- Returns:
- the requested substring
- Throws:
java.lang.IndexOutOfBoundsException
- if the start position is out of range
-
substring
public abstract UnicodeString substring(long start, long end)
Get a substring of this string, with a given start and end position- Parameters:
start
- the start position (0-based): that is, the position of the first code point to be includedend
- the end position (0-based): specifically, the position of the first code point not to be included- Returns:
- the requested substring
- Throws:
java.lang.IndexOutOfBoundsException
- if the start/end positions are out of range (the conditions are the same as forString.substring()
)
-
prefix
public UnicodeString prefix(long end)
Get a substring of this string, starting at position 0, with a given end position- Parameters:
end
- the end position (0-based): specifically, the position of the first code point not to be included- Returns:
- the requested substring
- Throws:
java.lang.IndexOutOfBoundsException
- if the end position is out of range
-
concat
public UnicodeString concat(UnicodeString other)
Concatenate with another string, returning a new string- Parameters:
other
- the string to be appended- Returns:
- the result of concatenating this string followed by the other
-
checkSubstringBounds
protected void checkSubstringBounds(long start, long end)
-
verifyCharacters
public void verifyCharacters()
Diagnostic method: verify that all the characters in the string are valid XML codepoints- Throws:
java.lang.IllegalStateException
- if the contents are invalid
-
equals
public boolean equals(java.lang.Object obj)
- Overrides:
equals
in classjava.lang.Object
-
hashCode
public int hashCode()
Compute a hashCode. All implementations ofUnicodeString
use compatible hash codes and the hashing algorithm is therefore identical to that forjava.lang.String
. This means that for strings containing Astral characters, the hash code needs to be computed by decomposing an Astral character into a surrogate pair.- Overrides:
hashCode
in classjava.lang.Object
- Returns:
- the hash code
-
compareTo
public int compareTo(UnicodeString other)
Compare this string to another using codepoint comparison- Specified by:
compareTo
in interfacejava.lang.Comparable<UnicodeString>
- Parameters:
other
- the other string- Returns:
- -1 if this string comes first, 0 if they are equal, +1 if the other string comes first
-
asAtomic
public AtomicValue asAtomic()
Get an atomic value that encapsulates this match key. Needed to support the collation-key() function.- Specified by:
asAtomic
in interfaceAtomicMatchKey
- Returns:
- an atomic value that encapsulates this match key
-
requireInt
public static int requireInt(long value)
Utility method for use where strings longer than 2^31 characters cannot yet be handled.- Parameters:
value
- the actual value of a character position within a string, or the length of a string- Returns:
- the value as an integer if it is within range
- Throws:
java.lang.UnsupportedOperationException
- if the supplied value exceedsInteger.MAX_VALUE
-
-