Info-ZIP DEFLATE fixed Huffman inefficiencies

Question

As an example, consider a file consisting of 7 newline characters (bytes [0x0A, 0x0A, 0x0A, 0x0A, 0x0A, 0x0A, 0x0A]). Using the Info-ZIP implementation of zip (this ships with OSX and other operating systems) to create a zip file, the file is stored with fixed huffman mode (mode 01). The bytes in the file data block are:

0xE3 0xE2 0x02 0x03 0x00

Based on the bits, there is one block (the first header has the last block bit set to true).

The first code in the block is the literal 0x0A (the next 7 bits encode 0x5C and the following is 0). The second code in the block is the literal 0x0A, and the third code in the block indicates a copy of 5 bytes with distance 1.

If I understand the operations properly, the first literal is not copied. Going byte by byte, using * to mark the byte being copied to the end, the operations are:

[0x0A *0x0A]                          ->  [0x0A  0x0A  0x0A]
[0x0A  0x0A *0x0A]                    ->  [0x0A  0x0A  0x0A *0x0A]
[0x0A  0x0A  0x0A *0x0A]              ->  [0x0A  0x0A  0x0A  0x0A *0x0A]
[0x0A  0x0A  0x0A  0x0A *0x0A]        ->  [0x0A  0x0A  0x0A  0x0A  0x0A *0x0A]
[0x0A  0x0A  0x0A  0x0A  0x0A *0x0A]  ->  [0x0A  0x0A  0x0A  0x0A  0x0A  0x0A *0x0A]

Question: It should be clear from the diagram that the first byte is not copied. The most efficient encoding would only need two codes: literal 0x0A and copy of 6 bytes with distance 1. Is there a reason (e.g. bug in another implementation) why the more efficient encoding should not be used?

Mark Adler Mark Adler · Accepted Answer · 2014-12-11T03:41:47

From deflate.c:

/* Stop when cur_match becomes <= limit. To simplify the code,
 * we prevent matches with the string of window index 0.
 */

Info-ZIP DEFLATE fixed Huffman inefficiencies

1 Answers