|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.hd.d.pg2k.svrCore.Compact7BitString
public class Compact7BitString
Compact immutable in-memory and on-the-wire representation of a (short) 7-bit ASCII String. Converting to and from this form should be relatively resource-light, eg low CPU and memory requirements.
Where there is a conflict between memory consumption and CPU, this implementation aims to minimise memory use at the possible expense of CPU. Thus this class is probably inefficient to use as a key in a hashed data structure where the hash is used heavily, since we do not cache the hash even though it may be expensive to compute.
This class is instance-controlled, and uses intern() to avoid duplicate in-memory copies.
This cannot hold text longer than Short.MAX_VALUE characters long.
Internally this uses a byte[] to store the data (a basic halving of in-memory size in most JVMs) with some tokenising for further compaction. No attempt is made to share state with other instances in memory (unless a shared StaticDictionary is used).
An instance of Compact7BitString will serialise itself as a String rather than as the usual Compact7BitString if it seems likely to be more efficent/compressable/safe to do so (a) given the special handling of String and (b) for global compressability or (c) if the value cannot be encoded other than as a String instance.
So a coder had better make any serialisable field with a Compact7BitString value be of type Object or Serializable to allow it to be read back as either.
This is not publicly constructuable, and may have specialised sub-classes to support some features, while keeping the base class as small as possible.
| Nested Class Summary | |
|---|---|
static class |
Compact7BitString.StaticDictionary
Immutable static dictionary to improve in-memory compression. |
private static class |
Compact7BitString.WithDict
Specialised immutable sub-class to allow use with a StaticDictionary. |
private static class |
Compact7BitString.WordChars
Boolean true/false flags for what are considered "word" chars; initialised on first use. |
| Field Summary | |
|---|---|
private static int |
_FUDGE_FACTOR
Margin that byte[] must be smaller than String.length() for this to be smaller on the wire. |
private static boolean |
ALLOW_1_CHAR_MTM
If true, allow us to do include 1-char tokens in multi-token matches (MTMs). |
private static boolean |
ALLOW_MULTI_TOKEN_MATCHING
If true then allow multi-token matches (MTM) for better compression of very repetitive text. |
private static boolean |
ALLOW_PARTIAL_TOKEN_MATCHING
If true, allow partial (prefix) token matching for long (semi-)unique tokens. |
private static boolean |
DONT_TRY_TO_MATCH_PREV_TOKEN
If true, then don't try to match the immediately-previous input token. |
static Compact7BitString |
EMPTY
Value representing an empty String. |
private static boolean |
LOG_STATS
If true, then log some global stats for tuning. |
private static int |
MAX_FULL_HASH_LEN
Maximum number of bytes to compute a full hash over; strictly positive (or -1 to always compute a full hash). |
private static long |
serialVersionUID
Unique serialisation ID. |
private byte[] |
text
Basic (unshared) representation of 1 byte per character; never empty (is null instead). |
private static int |
TOKEN_SEQ_LEN_BITS
Bits that we use to represent a token sequence length; strictly positive and less than 8. |
private static int |
TOKEN_SEQ_LEN_MAX
Maximum number of sequential tokens that we can encode as a sequence; strictly positive. |
private static int |
TOKEN_SEQ_LEN_MIN
Minimum number of sequential tokens that we can encode as a sequence; strictly positive. |
private static int |
TOKEN_SEQ_OFFSET_MIN
Maximally-negative offset that we can hold above the sequence length in one byte; strictly negative. |
private static boolean |
VERBOSE_DEBUG
If true, log our encode/decode steps to help debug problems. |
| Constructor Summary | |
|---|---|
private |
Compact7BitString(byte[] text)
Make simple un-tokenised byte[] representation. |
| Method Summary | |
|---|---|
static Compact7BitString |
convertToCompact7BitString(java.lang.CharSequence s,
Compact7BitString.StaticDictionary dict)
Losslessly convert a String entirely consisting of 7-bit ASCII text to a tokenised form for compression/serialisation; not null unless the input is null. |
boolean |
equals(java.lang.Object obj)
Equal iff the static dictionaries (if any) and compressed texts are identical. |
protected Compact7BitString.StaticDictionary |
getDict()
Get the static dictionary, null if none. |
byte[] |
getInternalBytes()
Copy of the raw bytes in the internal representation; null if none. |
int |
hashCode()
Computes the hash code for the text. |
boolean |
isEmpty()
Returns true iff this holds an empty string (""). |
static Compact7BitString |
makeFromInternalBytes(byte[] raw,
Compact7BitString.StaticDictionary dict)
Make a Compact7BitString from the raw getInternalBytes() value; never null. |
private void |
readObject(java.io.ObjectInputStream in)
Deserialise. |
private java.lang.Object |
readResolve()
Deserialise: use constructor for validation, defensive copying, etc. |
private static java.util.List<java.lang.String> |
tokenise(java.lang.CharSequence in)
Routine to chop the (non-empty) input CharSequence as an in-order List of String tokens; result may be empty but is never null. |
java.lang.String |
toString()
Convert to full String form. |
void |
validateObject()
Validate fields/state. |
private void |
writeObject(java.io.ObjectOutputStream oos)
Write out a minimally-redundant form of our internal information. |
protected java.lang.Object |
writeReplace()
Returns the best format to write the contained text to the wire. |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
private static final boolean VERBOSE_DEBUG
private static final boolean LOG_STATS
private transient byte[] text
Written out in a customised low-overhead form.
private static final int MAX_FULL_HASH_LEN
Empirically measured to be fastest at 512 or greater compact()ing live AEP as at 20070402.
public static final Compact7BitString EMPTY
private static final int _FUDGE_FACTOR
This value represents the overhead of serialising this instance compared to the special-case String representation plus some other overheads and potential loss of global compressability.
private static final long serialVersionUID
private static final boolean DONT_TRY_TO_MATCH_PREV_TOKEN
This also frees up the "-1" offset value as an escape, eg for multi-token-match expansion.
private static final boolean ALLOW_MULTI_TOKEN_MATCHING
private static final boolean ALLOW_1_CHAR_MTM
private static final boolean ALLOW_PARTIAL_TOKEN_MATCHING
private static final int TOKEN_SEQ_LEN_BITS
private static final int TOKEN_SEQ_LEN_MIN
private static final int TOKEN_SEQ_LEN_MAX
private static final int TOKEN_SEQ_OFFSET_MIN
| Constructor Detail |
|---|
private Compact7BitString(byte[] text)
The empty string should be represented with a null array to save space, not an empty array.
| Method Detail |
|---|
public boolean isEmpty()
public java.lang.String toString()
toString in class java.lang.Objectprotected Compact7BitString.StaticDictionary getDict()
public byte[] getInternalBytes()
public static Compact7BitString makeFromInternalBytes(byte[] raw,
Compact7BitString.StaticDictionary dict)
The result is EMPTY or is intern()ed so as to avoid duplicates.
dict - the static dictionary; can be nullpublic boolean equals(java.lang.Object obj)
This is designed to work for specialised sub-classes too.
equals in class java.lang.Objectpublic int hashCode()
By default this hash includes all bytes of the compressed text in the hash, since that gives best overall performance in hash-based collections.
However, we cap the time spent generating a hash for very long texts, hoping to glean some useful full-text information from the compressed length, and some further useful 'segregation' bits from the dictionary name if any.
Note that we do NOT cache the computed hash code since the aim of this representation is to be as compact as possible and holding an extra word of cached hash would undermine that.
hashCode in class java.lang.Object
private void readObject(java.io.ObjectInputStream in)
throws java.io.IOException,
java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassNotFoundException
private void writeObject(java.io.ObjectOutputStream oos)
throws java.io.IOException
We don't write *any* default state.
We must never be asked to serialise if we have a (non-null) dictionary.
We must never be asked to serialise if the text is too long for us to encode.
java.io.IOExceptionprivate java.lang.Object readResolve()
protected java.lang.Object writeReplace()
throws java.io.ObjectStreamException
We almost always write ourself out as a String assuming that a good general-purpose compressor such as ZIP will do better than our memory-efficient form given the original text, but if we manage to find a lot of compressability then we'll save ourself directly.
If a static dictionary is in use then we force serialisation as a String for safety (to avoid ambiguity).
java.io.ObjectStreamException
public void validateObject()
throws java.io.InvalidObjectException
Barf if something bad is found. (Maybe allow some extra info in debug version.)
validateObject in interface java.io.ObjectInputValidationjava.io.InvalidObjectExceptionprivate static java.util.List<java.lang.String> tokenise(java.lang.CharSequence in)
The returned list has fast random access, ie with get(index).
public static final Compact7BitString convertToCompact7BitString(java.lang.CharSequence s,
Compact7BitString.StaticDictionary dict)
This is designed to be relatively fast and consume little memory.
Use of a good shared static dictionary between multiple instances may significantly reduce memory consumption, but it is not permitted to serialise instances using such a dictionary other than as a String value.
This does NOT intern() the result since it would not be appropriate in all cases (and could add significant CPU and memory overhead).
s - the 7-bit ASCII text to encodedict - static compression dictionary to use, or null if none
java.lang.IllegalArgumentException - if the argument is not 7-bit ASCII
ie contains characters outside the range 0--127
|
DHD Multimedia Gallery V1.50.55 | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||