lzip: Algorithm

 
 5 Algorithm
 ***********
 
 In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
 concrete algorithm; it is more like "any algorithm using the LZMA coding
 scheme". LZMA compression consists in describing the uncompressed data as a
 succession of coding sequences from the set shown in Section 'What is
 coded' (⇒what-is-coded), and then encoding them using a range
 encoder. For example, the option '-0' of lzip uses the scheme in almost the
 simplest way possible; issuing the longest match it can find, or a literal
 byte if it can't find a match. Inversely, a much more elaborated way of
 finding coding sequences of minimum size than the one currently used by
 lzip could be developed, and the resulting sequence could also be coded
 using the LZMA coding scheme.
 
    Lzip currently implements two variants of the LZMA algorithm: fast (used
 by option '-0') and normal (used by all other compression levels).
 
    The high compression of LZMA comes from combining two basic, well-proven
 compression ideas: sliding dictionaries (LZ77/78) and markov models (the
 thing used by every compression algorithm that uses a range encoder or
 similar order-0 entropy coder as its last stage) with segregation of
 contexts according to what the bits are used for.
 
    Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
 which reduces redundancy by translating chunks of data to their
 corresponding distance-length pairs. The second stage is a range encoder
 that uses a different probability model for each type of data: distances,
 lengths, literal bytes, etc.
 
    Here is how it works, step by step:
 
    1) The member header is written to the output stream.
 
    2) The first byte is coded literally, because there are no previous
 bytes to which the match finder can refer to.
 
    3) The main encoder advances to the next byte in the input data and
 calls the match finder.
 
    4) The match finder fills an array with the minimum distances before the
 current byte where a match of a given length can be found.
 
    5) Go back to step 3 until a sequence (formed of pairs, repeated
 distances, and literal bytes) of minimum price has been formed. Where the
 price represents the number of output bits produced.
 
    6) The range encoder encodes the sequence produced by the main encoder
 and sends the bytes produced to the output stream.
 
    7) Go back to step 3 until the input data are finished or until the
 member or volume size limits are reached.
 
    8) The range encoder is flushed.
 
    9) The member trailer is written to the output stream.
 
    10) If there are more data to compress, go back to step 1.
 
 
    During compression, lzip reads data in large blocks (one dictionary size
 at a time). Therefore it may block for up to tens of seconds any process
 feeding data to it through a pipe. This is normal. The blocking intervals
 get longer with higher compression levels because dictionary size increases
 (and compression speed decreases) with compression level.
 
 The ideas embodied in lzip are due to (at least) the following people:
 Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the
 definition of Markov chains), G.N.N. Martin (for the definition of range
 encoding), Igor Pavlov (for putting all the above together in LZMA), and
 Julian Seward (for bzip2's CLI).