lzip: Stream format

 
 7 Format of the LZMA stream in lzip files
 *****************************************
 
 The LZMA algorithm has three parameters, called "special LZMA properties",
 to adjust it for some kinds of binary data. These parameters are:
 'literal_context_bits' (with a default value of 3),
 'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits'
 (with a default value of 2). As a general purpose compressor, lzip only
 uses the default values for these parameters. In particular
 'literal_pos_state_bits' has been optimized away and does not even appear
 in the code.
 
    Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker (the
 distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the 'member
 size' field in the member trailer allows the verification of stream
 integrity. The EOS marker is the only marker allowed in lzip files. The
 LZMA stream in lzip files always has these two features (default properties
 and EOS marker) and is referred to in this document as LZMA-302eos. This
 simplified form of the LZMA stream format has been chosen to maximize
 interoperability and safety.
 
    The second stage of LZMA is a range encoder that uses a different
 probability model for each type of symbol: distances, lengths, literal
 bytes, etc. Range encoding conceptually encodes all the symbols of the
 message into one number. Unlike Huffman coding, which assigns to each
 symbol a bit-pattern and concatenates all the bit-patterns together, range
 encoding can compress one symbol to less than one bit. Therefore the
 compressed data produced by a range encoder can't be split in pieces that
 could be described individually.
 
    It seems that the only way of describing the LZMA-302eos stream is to
 describe the algorithm that decodes it. And given the many details about
 the range decoder that need to be described accurately, the source code of
 a real decompressor seems the only appropriate reference to use.
 
    What follows is a description of the decoding algorithm for LZMA-302eos
 streams using as reference the source code of "lzd", an educational
 decompressor for lzip files which can be downloaded from the lzip download
 directory. Lzd is written in C++11 and its source code is included in
 appendix A. ⇒Reference source code.
 
 
 7.1 What is coded
 =================
 
 The LZMA stream includes literals, matches, and repeated matches (matches
 reusing a recently used distance). There are 7 different coding sequences:
 
 Bit sequence                Name        Description
 -----------------------------------------------------------------------------
 0 + byte                    literal     literal byte
 1 + 0 + len + dis           match       distance-length pair
 1 + 1 + 0 + 0               shortrep    1 byte match at latest used distance
 1 + 1 + 0 + 1 + len         rep0        len bytes match at latest used distance
 1 + 1 + 1 + 0 + len         rep1        len bytes match at second latest used
                                         distance
 1 + 1 + 1 + 1 + 0 + len     rep2        len bytes match at third latest used
                                         distance
 1 + 1 + 1 + 1 + 1 + len     rep3        len bytes match at fourth latest used
                                         distance
 
 
    In the following tables, multibit sequences are coded in normal order,
 from most significant bit (MSB) to least significant bit (LSB), except
 where noted otherwise.
 
    Lengths (the 'len' in the table above) are coded as follows:
 
 Bit sequence                           Description
 ----------------------------------------------------------------------------
 0 + 3 bits                             lengths from 2 to 9
 1 + 0 + 3 bits                         lengths from 10 to 17
 1 + 1 + 8 bits                         lengths from 18 to 273
 
 
    The coding of distances is a little more complicated, so I'll begin by
 explaining a simpler version of the encoding.
 
    Imagine you need to encode a number from 0 to 2^32 - 1, and you want to
 do it in a way that produces shorter codes for the smaller numbers. You may
 first encode the position of the most significant bit that is set to 1,
 which you may find by making a bit scan from the left (from the MSB). A
 position of 0 means that the number is 0 (no bit is set), 1 means the LSB is
 the first bit set (the number is 1), and 32 means the MSB is set (i.e., the
 number is >= 0x80000000). Then, if the position is >= 2, you encode the
 remaining position - 1 bits. Let's call these bits "direct bits" because
 they are coded directly by value instead of indirectly by position.
 
    The inconvenient of this simple method is that it needs 6 bits to encode
 the position, but it just uses 33 of the 64 possible values, wasting almost
 half of the codes.
 
    The intelligent trick of LZMA is that it encodes in what it calls a
 "slot" the position of the most significant bit set, along with the value
 of the next bit, using the same 6 bits that would take to encode the
 position alone. This seems to need 66 slots (twice the number of
 positions), but for positions 0 and 1 there is no next bit, so the number
 of slots needed is 64 (0 to 63).
 
    The 6 bits representing this "slot number" are then context-coded. If
 the distance is >= 4, the remaining bits are encoded as follows.
 'direct_bits' is the amount of remaining bits (from 1 to 30) needed to form
 a complete distance, and is calculated as (slot >> 1) - 1. If a distance
 needs 6 or more direct_bits, the last 4 bits are encoded separately. The
 last piece (all the direct_bits for distances 4 to 127, or the last 4 bits
 for distances >= 128) is context-coded in reverse order (from LSB to MSB).
 For distances >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5
 probability.
 
 Bit sequence                           Description
 ----------------------------------------------------------------------------
 slot                                   distances from 0 to 3
 slot + direct_bits                     distances from 4 to 127
 slot + (direct_bits - 4) + 4 bits      distances from 128 to 2^32 - 1
 
 
 7.2 The coding contexts
 =======================
 
 These contexts ('Bit_model' in the source), are integers or arrays of
 integers representing the probability of the corresponding bit being 0.
 
    The indices used in these arrays are:
 
 'state'
      A state machine ('State' in the source) with 12 states (0 to 11),
      coding the latest 2 to 4 types of sequences processed. The initial
      state is 0.
 
 'pos_state'
      Value of the 2 least significant bits of the current position in the
      decoded data.
 
 'literal_state'
      Value of the 3 most significant bits of the latest byte decoded.
 
 'len_state'
      Coded value of the current match length (length - 2), with a maximum
      of 3. The resulting value is in the range 0 to 3.
 
 
    The types of previous sequences corresponding to each state are shown in
 the following table. '!literal' is any sequence except a literal byte.
 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The last type in
 each line is the most recent.
 
 State   Types of previous sequences
 ------------------------------------------------------
 0       literal, literal, literal
 1       match, literal, literal
 2       rep or (!literal, shortrep), literal, literal
 3       literal, shortrep, literal, literal
 4       match, literal
 5       rep or (!literal, shortrep), literal
 6       literal, shortrep, literal
 7       literal, match
 8       literal, rep
 9       literal, shortrep
 10      !literal, match
 11      !literal, (rep or shortrep)
 
 
    The contexts for decoding the type of coding sequence are:
 
 Name            Indices                     Used when
 ----------------------------------------------------------------------------
 bm_match        state, pos_state            sequence start
 bm_rep          state                       after sequence 1
 bm_rep0         state                       after sequence 11
 bm_rep1         state                       after sequence 111
 bm_rep2         state                       after sequence 1111
 bm_len          state, pos_state            after sequence 110
 
 
    The contexts for decoding distances are:
 
 Name            Indices                 Used when
 ----------------------------------------------------------------------------
 bm_dis_slot     len_state, bit tree     distance start
 bm_dis          reverse bit tree        after slots 4 to 13
 bm_align        reverse bit tree        for distances >= 128, after fixed
                                         probability bits
 
 
    There are two separate sets of contexts for lengths ('Len_model' in the
 source). One for normal matches, the other for repeated matches. The
 contexts in each Len_model are (see 'decode_len' in the source):
 
 Name            Indices                        Used when
 ---------------------------------------------------------------------------
 choice1         none                           length start
 choice2         none                           after sequence 1
 bm_low          pos_state, bit tree            after sequence 0
 bm_mid          pos_state, bit tree            after sequence 10
 bm_high         bit tree                       after sequence 11
 
 
    The context array 'bm_literal' is special. In principle it acts as a
 normal bit tree context, the one selected by 'literal_state'. But if the
 previous decoded byte was not a literal, two other bit tree contexts are
 used depending on the value of each bit in 'match_byte' (the byte at the
 latest used distance), until a bit is decoded that is different from its
 corresponding bit in 'match_byte'. After the first difference is found, the
 rest of the byte is decoded using the normal bit tree context. (See
 'decode_matched' in the source).
 
 
 7.3 The range decoder
 =====================
 
 The LZMA stream is consumed one byte at a time by the range decoder. (See
 'normalize' in the source). Every byte consumed produces a variable number
 of decoded bits, depending on how well these bits agree with their context.
 (See 'decode_bit' in the source).
 
    The range decoder state consists of two unsigned 32-bit variables:
 'range' (representing the most significant part of the range size not yet
 decoded) and 'code' (representing the current point within 'range').
 'range' is initialized to 2^32 - 1, and 'code' is initialized to 0.
 
    The range encoder produces a first 0 byte that must be ignored by the
 range decoder. This is done by shifting 5 bytes in the initialization of
 'code' instead of 4. (See the 'Range_decoder' constructor in the source).
 
 
 7.4 Decoding and verifying the LZMA stream
 ==========================================
 
 After decoding the member header and obtaining the dictionary size, the
 range decoder is initialized and then the LZMA decoder enters a loop (see
 'decode_member' in the source) where it invokes the range decoder with the
 appropriate contexts to decode the different coding sequences (matches,
 repeated matches, and literal bytes), until the "End Of Stream" marker is
 decoded.
 
    Once the "End Of Stream" marker has been decoded, the decompressor reads
 and decodes the member trailer, and verifies that the three integrity
 factors stored there (CRC, data size, and member size) match those computed
 from the data.