org.hd.d.pg2k.svrCore
Class Compact7BitString

java.lang.Object
  extended by org.hd.d.pg2k.svrCore.Compact7BitString
All Implemented Interfaces:
java.io.ObjectInputValidation, java.io.Serializable, MemoryTools.Internable
Direct Known Subclasses:
Compact7BitString.WithDict

public class Compact7BitString
extends java.lang.Object
implements java.io.Serializable, java.io.ObjectInputValidation, MemoryTools.Internable

Compact immutable in-memory and on-the-wire representation of a (short) 7-bit ASCII String. Converting to and from this form should be relatively resource-light, eg low CPU and memory requirements.

Where there is a conflict between memory consumption and CPU, this implementation aims to minimise memory use at the possible expense of CPU. Thus this class is probably inefficient to use as a key in a hashed data structure where the hash is used heavily, since we do not cache the hash even though it may be expensive to compute.

This class is instance-controlled, and uses intern() to avoid duplicate in-memory copies.

This cannot hold text longer than Short.MAX_VALUE characters long.

Internally this uses a byte[] to store the data (a basic halving of in-memory size in most JVMs) with some tokenising for further compaction. No attempt is made to share state with other instances in memory (unless a shared StaticDictionary is used).

An instance of Compact7BitString will serialise itself as a String rather than as the usual Compact7BitString if it seems likely to be more efficent/compressable/safe to do so (a) given the special handling of String and (b) for global compressability or (c) if the value cannot be encoded other than as a String instance.

So a coder had better make any serialisable field with a Compact7BitString value be of type Object or Serializable to allow it to be read back as either.

This is not publicly constructuable, and may have specialised sub-classes to support some features, while keeping the base class as small as possible.

See Also:
Serialized Form

Nested Class Summary
static class Compact7BitString.StaticDictionary
          Immutable static dictionary to improve in-memory compression.
private static class Compact7BitString.WithDict
          Specialised immutable sub-class to allow use with a StaticDictionary.
private static class Compact7BitString.WordChars
          Boolean true/false flags for what are considered "word" chars; initialised on first use.
 
Field Summary
private static int _FUDGE_FACTOR
          Margin that byte[] must be smaller than String.length() for this to be smaller on the wire.
private static boolean ALLOW_1_CHAR_MTM
          If true, allow us to do include 1-char tokens in multi-token matches (MTMs).
private static boolean ALLOW_MULTI_TOKEN_MATCHING
          If true then allow multi-token matches (MTM) for better compression of very repetitive text.
private static boolean ALLOW_PARTIAL_TOKEN_MATCHING
          If true, allow partial (prefix) token matching for long (semi-)unique tokens.
private static boolean DONT_TRY_TO_MATCH_PREV_TOKEN
          If true, then don't try to match the immediately-previous input token.
static Compact7BitString EMPTY
          Value representing an empty String.
private static boolean LOG_STATS
          If true, then log some global stats for tuning.
private static int MAX_FULL_HASH_LEN
          Maximum number of bytes to compute a full hash over; strictly positive (or -1 to always compute a full hash).
private static long serialVersionUID
          Unique serialisation ID.
private  byte[] text
          Basic (unshared) representation of 1 byte per character; never empty (is null instead).
private static int TOKEN_SEQ_LEN_BITS
          Bits that we use to represent a token sequence length; strictly positive and less than 8.
private static int TOKEN_SEQ_LEN_MAX
          Maximum number of sequential tokens that we can encode as a sequence; strictly positive.
private static int TOKEN_SEQ_LEN_MIN
          Minimum number of sequential tokens that we can encode as a sequence; strictly positive.
private static int TOKEN_SEQ_OFFSET_MIN
          Maximally-negative offset that we can hold above the sequence length in one byte; strictly negative.
private static boolean VERBOSE_DEBUG
          If true, log our encode/decode steps to help debug problems.
 
Constructor Summary
private Compact7BitString(byte[] text)
          Make simple un-tokenised byte[] representation.
 
Method Summary
static Compact7BitString convertToCompact7BitString(java.lang.CharSequence s, Compact7BitString.StaticDictionary dict)
          Losslessly convert a String entirely consisting of 7-bit ASCII text to a tokenised form for compression/serialisation; not null unless the input is null.
 boolean equals(java.lang.Object obj)
          Equal iff the static dictionaries (if any) and compressed texts are identical.
protected  Compact7BitString.StaticDictionary getDict()
          Get the static dictionary, null if none.
 byte[] getInternalBytes()
          Copy of the raw bytes in the internal representation; null if none.
 int hashCode()
          Computes the hash code for the text.
 boolean isEmpty()
          Returns true iff this holds an empty string ("").
static Compact7BitString makeFromInternalBytes(byte[] raw, Compact7BitString.StaticDictionary dict)
          Make a Compact7BitString from the raw getInternalBytes() value; never null.
private  void readObject(java.io.ObjectInputStream in)
          Deserialise.
private  java.lang.Object readResolve()
          Deserialise: use constructor for validation, defensive copying, etc.
private static java.util.List<java.lang.String> tokenise(java.lang.CharSequence in)
          Routine to chop the (non-empty) input CharSequence as an in-order List of String tokens; result may be empty but is never null.
 java.lang.String toString()
          Convert to full String form.
 void validateObject()
          Validate fields/state.
private  void writeObject(java.io.ObjectOutputStream oos)
          Write out a minimally-redundant form of our internal information.
protected  java.lang.Object writeReplace()
          Returns the best format to write the contained text to the wire.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

VERBOSE_DEBUG

private static final boolean VERBOSE_DEBUG
If true, log our encode/decode steps to help debug problems.

See Also:
Constant Field Values

LOG_STATS

private static final boolean LOG_STATS
If true, then log some global stats for tuning. Should generally be false so as to avoid wasting CPU time and memory.

See Also:
Constant Field Values

text

private transient byte[] text
Basic (unshared) representation of 1 byte per character; never empty (is null instead). Negative values are backward references by the specified number of tokens, eg -2 means "insert a copy of the token two places back".

Written out in a customised low-overhead form.


MAX_FULL_HASH_LEN

private static final int MAX_FULL_HASH_LEN
Maximum number of bytes to compute a full hash over; strictly positive (or -1 to always compute a full hash). This should probably be large enough to allow full hashing of the bulk of typical texts (since a good hash over all content tends to perform best over the whole lifecycle) especially where they are too short for the length itself to be signifiant hash-bit source.

Empirically measured to be fastest at 512 or greater compact()ing live AEP as at 20070402.

See Also:
Constant Field Values

EMPTY

public static final Compact7BitString EMPTY
Value representing an empty String.


_FUDGE_FACTOR

private static final int _FUDGE_FACTOR
Margin that byte[] must be smaller than String.length() for this to be smaller on the wire. If we get this slightly too large then we will write more Strings than necessary but may improve compression of multiple instances a little.

This value represents the overhead of serialising this instance compared to the special-case String representation plus some other overheads and potential loss of global compressability.

See Also:
Constant Field Values

serialVersionUID

private static final long serialVersionUID
Unique serialisation ID.

See Also:
Constant Field Values

DONT_TRY_TO_MATCH_PREV_TOKEN

private static final boolean DONT_TRY_TO_MATCH_PREV_TOKEN
If true, then don't try to match the immediately-previous input token. This is usually not going to be effective as unless we split tokensw we can never match the previous value as we alternate between word and non-word. So not trying to match saves a little bit of time.

This also frees up the "-1" offset value as an escape, eg for multi-token-match expansion.

See Also:
Constant Field Values

ALLOW_MULTI_TOKEN_MATCHING

private static final boolean ALLOW_MULTI_TOKEN_MATCHING
If true then allow multi-token matches (MTM) for better compression of very repetitive text. This is for where a number of tokens in sequence are identical to some previous (near) token sequence.

See Also:
Constant Field Values

ALLOW_1_CHAR_MTM

private static final boolean ALLOW_1_CHAR_MTM
If true, allow us to do include 1-char tokens in multi-token matches (MTMs). This should help maximise the compression from multi-token matches, since many possible multi-token sequences will start with a single-character token, such as " ".

See Also:
Constant Field Values

ALLOW_PARTIAL_TOKEN_MATCHING

private static final boolean ALLOW_PARTIAL_TOKEN_MATCHING
If true, allow partial (prefix) token matching for long (semi-)unique tokens. Can be slow, but allows a little more compression.

See Also:
Constant Field Values

TOKEN_SEQ_LEN_BITS

private static final int TOKEN_SEQ_LEN_BITS
Bits that we use to represent a token sequence length; strictly positive and less than 8.

See Also:
Constant Field Values

TOKEN_SEQ_LEN_MIN

private static final int TOKEN_SEQ_LEN_MIN
Minimum number of sequential tokens that we can encode as a sequence; strictly positive. Since the minimum encoding is usually 2 bytes then we set a minimum sequence length longer than this to guarantee to do better than just a sequence of separate back-references.

See Also:
Constant Field Values

TOKEN_SEQ_LEN_MAX

private static final int TOKEN_SEQ_LEN_MAX
Maximum number of sequential tokens that we can encode as a sequence; strictly positive. Note that we store the sequence length - 2, since we would use the normal back-reference for a single-token match, so we use a zero value to indicate a length-of-two sequence.

See Also:
Constant Field Values

TOKEN_SEQ_OFFSET_MIN

private static final int TOKEN_SEQ_OFFSET_MIN
Maximally-negative offset that we can hold above the sequence length in one byte; strictly negative. We have an implict logical 9th sign bit, always 1 (ie negative).

See Also:
Constant Field Values
Constructor Detail

Compact7BitString

private Compact7BitString(byte[] text)
Make simple un-tokenised byte[] representation. The byte array must be pure 7-bit ASCII and it must be guaranteed unshared so that we need not copy it for safety.

The empty string should be represented with a null array to save space, not an empty array.

Method Detail

isEmpty

public boolean isEmpty()
Returns true iff this holds an empty string (""). Can avoid decoding an entire String value just to discover that it is not empty.


toString

public java.lang.String toString()
Convert to full String form.

Overrides:
toString in class java.lang.Object

getDict

protected Compact7BitString.StaticDictionary getDict()
Get the static dictionary, null if none. Always null in the base class since we do not carry a dictionary to save space.


getInternalBytes

public byte[] getInternalBytes()
Copy of the raw bytes in the internal representation; null if none. The internal representation is subject to change, but, with care, this can be used for extra-compact temporary storage.


makeFromInternalBytes

public static Compact7BitString makeFromInternalBytes(byte[] raw,
                                                      Compact7BitString.StaticDictionary dict)
Make a Compact7BitString from the raw getInternalBytes() value; never null. This is not guaranteed to work if getInternalBytes() is called in one release and serialised in another, though this will not be broken casually.

The result is EMPTY or is intern()ed so as to avoid duplicates.

Parameters:
dict - the static dictionary; can be null

equals

public boolean equals(java.lang.Object obj)
Equal iff the static dictionaries (if any) and compressed texts are identical. In principle identical char/byte sequences could be compressed in different ways, but we only treat as equal items whose internal representations are also identical which makes equalty and hashcode computation relatively fast.

This is designed to work for specialised sub-classes too.

Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
Computes the hash code for the text. The hash is zero for a zero-length (empty) text string.

By default this hash includes all bytes of the compressed text in the hash, since that gives best overall performance in hash-based collections.

However, we cap the time spent generating a hash for very long texts, hoping to glean some useful full-text information from the compressed length, and some further useful 'segregation' bits from the dictionary name if any.

Note that we do NOT cache the computed hash code since the aim of this representation is to be as compact as possible and holding an extra word of cached hash would undermine that.

Overrides:
hashCode in class java.lang.Object

readObject

private void readObject(java.io.ObjectInputStream in)
                 throws java.io.IOException,
                        java.lang.ClassNotFoundException
Deserialise.

Throws:
java.io.IOException
java.lang.ClassNotFoundException

writeObject

private void writeObject(java.io.ObjectOutputStream oos)
                  throws java.io.IOException
Write out a minimally-redundant form of our internal information. The more-efficient on-the-wire format also makes defensive reading easier.

We don't write *any* default state.

We must never be asked to serialise if we have a (non-null) dictionary.

We must never be asked to serialise if the text is too long for us to encode.

Throws:
java.io.IOException

readResolve

private java.lang.Object readResolve()
Deserialise: use constructor for validation, defensive copying, etc.


writeReplace

protected java.lang.Object writeReplace()
                                 throws java.io.ObjectStreamException
Returns the best format to write the contained text to the wire. We write ourself to the wire as a java.lang.String if it appears that that would be more efficient and/or safer.

We almost always write ourself out as a String assuming that a good general-purpose compressor such as ZIP will do better than our memory-efficient form given the original text, but if we manage to find a lot of compressability then we'll save ourself directly.

If a static dictionary is in use then we force serialisation as a String for safety (to avoid ambiguity).

Throws:
java.io.ObjectStreamException

validateObject

public void validateObject()
                    throws java.io.InvalidObjectException
Validate fields/state. Called in the constructor and possibly after de-serialising.

Barf if something bad is found. (Maybe allow some extra info in debug version.)

Specified by:
validateObject in interface java.io.ObjectInputValidation
Throws:
java.io.InvalidObjectException

tokenise

private static java.util.List<java.lang.String> tokenise(java.lang.CharSequence in)
Routine to chop the (non-empty) input CharSequence as an in-order List of String tokens; result may be empty but is never null. The toString() method may be used to convert all if the input (or subsequences) to a String, so must behave well.

The returned list has fast random access, ie with get(index).


convertToCompact7BitString

public static final Compact7BitString convertToCompact7BitString(java.lang.CharSequence s,
                                                                 Compact7BitString.StaticDictionary dict)
Losslessly convert a String entirely consisting of 7-bit ASCII text to a tokenised form for compression/serialisation; not null unless the input is null. This should be especially effective for XML, but also OK for general text.

This is designed to be relatively fast and consume little memory.

Use of a good shared static dictionary between multiple instances may significantly reduce memory consumption, but it is not permitted to serialise instances using such a dictionary other than as a String value.

This does NOT intern() the result since it would not be appropriate in all cases (and could add significant CPU and memory overhead).

Parameters:
s - the 7-bit ASCII text to encode
dict - static compression dictionary to use, or null if none
Throws:
java.lang.IllegalArgumentException - if the argument is not 7-bit ASCII ie contains characters outside the range 0--127

DHD Multimedia Gallery V1.50.55

Copyright (c) 1996-2008, Damon Hart-Davis. All rights reserved.