lzip: Quality assurance

 
 4 Design, development, and testing of lzip
 ******************************************
 
 There are two ways of constructing a software design: One way is to make it
 so simple that there are obviously no deficiencies and the other way is to
 make it so complicated that there are no obvious deficiencies. The first
 method is far more difficult.
 -- C.A.R. Hoare
 
    Lzip is developed by volunteers who lack the resources required for
 extensive testing in all circumstances. It is up to you to test lzip before
 using it in mission-critical applications. However, a compressor like lzip
 is not a toy, and maintaining it is not a hobby. Many people's data depend
 on it. Therefore the lzip file format has been reviewed carefully and is
 believed to be free from negligent design errors.
 
    Lzip has been designed, written, and tested with great care to replace
 gzip and bzip2 as the standard general-purpose compressed format for
 unix-like systems. This chapter describes the lessons learned from these
 previous formats, and their application to the design of lzip.
 
 
 4.1 Format design
 =================
 
 When gzip was designed in 1992, computers and operating systems were much
 less capable than they are today. The designers of gzip tried to work around
 some of those limitations, like 8.3 file names, with additional fields in
 the file format.
 
    Today those limitations have mostly disappeared, and the format of gzip
 has proved to be unnecessarily complicated. It includes fields that were
 never used, others that have lost their usefulness, and finally others that
 have become too limited.
 
    Bzip2 was designed 5 years later, and its format is simpler than the one
 of gzip.
 
    Probably the worst defect of the gzip format from the point of view of
 data safety is the variable size of its header. If the byte at offset 3
 (flags) of a gzip member gets corrupted, it may become difficult to recover
 the data, even if the compressed blocks are intact, because it can't be
 known with certainty where the compressed blocks begin.
 
    By contrast, the header of a lzip member has a fixed length of 6. The
 LZMA stream in a lzip member always starts at offset 6, making it trivial to
 recover the data even if the whole header becomes corrupt.
 
    Bzip2 also provides a header of fixed length and marks the begin and end
 of each compressed block with six magic bytes, making it possible to find
 the compressed blocks even in case of file damage. But bzip2 does not store
 the size of each compressed block, as lzip does.
 
    Lziprecover is able to provide unique data recovery capabilities because
 the lzip format is extraordinarily safe. The simple and safe design of the
 file format complements the embedded error detection provided by the LZMA
 data stream. Any distance larger than the dictionary size acts as a
 forbidden symbol, allowing the decompressor to detect the approximate
 position of errors, and leaving very little work for the check sequence
 (CRC and data sizes) in the detection of errors. Lzip is usually able to
 detect all possible bit flips in the compressed data without resorting to
 the check sequence. It would be difficult to write an automatic recovery
 tool like lziprecover for the gzip format. And, as far as I know, it has
 never been written.
 
    Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
 decompressed data because it provides optimal accuracy in the detection of
 errors up to a compressed size of about 16 GiB, a size larger than that of
 most files. In the case of lzip, the additional detection capability of the
 decompressor reduces the probability of undetected errors several million
 times more, resulting in a combined integrity checking optimally accurate
 for any member size produced by lzip. Preliminary results suggest that the
 lzip format is safe enough to be used in critical safety avionics systems.
 
    The lzip format is designed for long-term archiving. Therefore it
 excludes any unneeded features that may interfere with the future
 extraction of the decompressed data.
 
 
 4.1.1 Gzip format (mis)features not present in lzip
 ---------------------------------------------------
 
 'Multiple algorithms'
      Gzip provides a CM (Compression Method) field that has never been used
      because it is a bad idea to begin with. New compression methods may
      require additional fields, making it impossible to implement new
      methods and, at the same time, keep the same format. This field does
      not solve the problem of format proliferation; it just makes the
      problem less obvious.
 
 'Optional fields in header'
      Unless special precautions are taken, optional fields are generally a
      bad idea because they produce a header of variable size. The gzip
      header has 2 fields that, in addition to being optional, are
      zero-terminated. This means that if any byte inside the field gets
      zeroed, or if the terminating zero gets altered, gzip won't be able to
      find neither the header CRC nor the compressed blocks.
 
 'Optional CRC for the header'
      Using an optional CRC for the header is not only a bad idea, it is an
      error; it circumvents the Hamming distance (HD) of the CRC and may
      prevent the extraction of perfectly good data. For example, if the CRC
      is used and the bit enabling it is reset by a bit flip, the header
      will appear to be intact (in spite of being corrupt) while the
      compressed blocks will appear to be totally unrecoverable (in spite of
      being intact). Very misleading indeed.
 
 'Metadata'
      The gzip format stores some metadata, like the modification time of the
      original file or the operating system on which compression took place.
      This complicates reproducible compression (obtaining identical
      compressed output from identical input).
 
 
 4.1.2 Lzip format improvements over gzip and bzip2
 --------------------------------------------------
 
 '64-bit size field'
      Probably the most frequently reported shortcoming of the gzip format
      is that it only stores the least significant 32 bits of the
      uncompressed size. The size of any file larger than 4 GiB gets
      truncated.
 
      Bzip2 does not store the uncompressed size of the file.
 
      The lzip format provides a 64-bit field for the uncompressed size.
      Additionally, lzip produces multimember output automatically when the
      size is too large for a single member, allowing for an unlimited
      uncompressed size.
 
 'Distributed index'
      The lzip format provides a distributed index that, among other things,
      helps plzip to decompress several times faster than pigz and helps
      lziprecover do its job. Neither the gzip format nor the bzip2 format
      do provide an index.
 
      A distributed index is safer and more scalable than a monolithic
      index. The monolithic index introduces a single point of failure in
      the compressed file and may limit the number of members or the total
      uncompressed size.
 
 
 4.2 Quality of implementation
 =============================
 
 'Accurate and robust error detection'
      The lzip format provides 3 factor integrity checking, and the
      decompressors report mismatches in each factor separately. This method
      detects most false positives for corruption. If just one byte in one
      factor fails but the other two factors match the data, it probably
      means that the data are intact and the corruption just affects the
      mismatching factor (CRC, data size, or member size) in the member
      trailer.
 
 'Multiple implementations'
      Just like the lzip format provides 3 factor protection against
      undetected data corruption, the development methodology of the lzip
      family of compressors provides 3 factor protection against undetected
      programming errors.
 
      Three related but independent compressor implementations, lzip, clzip,
      and minilzip/lzlib, are developed concurrently. Every stable release
      of any of them is tested to verify that it produces identical output
      to the other two. This guarantees that all three implement the same
      algorithm, and makes it unlikely that any of them may contain serious
      undiscovered errors. In fact, no errors have been discovered in lzip
      since 2009.
 
      Additionally, the three implementations have been extensively tested
      with unzcrash, valgrind, and 'american fuzzy lop' without finding a
      single vulnerability or false negative. ⇒Unzcrash
      (lziprecover)Unzcrash.
 
 'Dictionary size'
      Lzip automatically adapts the dictionary size to the size of each file.
      In addition to reducing the amount of memory required for
      decompression, this feature also minimizes the probability of being
      affected by RAM errors during compression.
 
 'Exit status'
      Returning a warning status of 2 is a design flaw of compress that
      leaked into the design of gzip. Both bzip2 and lzip are free from this
      flaw.