2
votes

I am developing a file compressor program. We are currently implementing .ZIP archiver standart, so that when generate a compressed .ZIP archiver any other reputable compressor (such as 7zip) can perfectly understand/uncompres it.

We are now developing the DEFLATE algorithm based on RFC 1951
We have a variant of LZ77 and the Huffman coding with fixed codes working perfectly and compatible with the RFC, thus working with Literal-Length + Distance values.

On the Dynamic Huffman Coding I am currently able to extract the Huffman trees from some compressed data (compressed via another reliable compressor), but when it's time to start decompressing the real data I get incorrect values.
Possibly I'm reading the trees in a wrong way.

I have not specificly found any place where someone explains with exactitude the way the values of these trees are stored on the compressed data.

I assume the encoded data follows the same literal-length values (0~285) + distance (0~30) with its corresponding extra bits per literal / distance as explained in the RFC the same way fixed huffman encoding does.

The way this is stored on fixed huffman encoding is that Huffman Codes are stored with the most significant bit of the code on the least significant bit in memory. This way you are able to navigate down the encoding tree reading bit by bit.
Extra bits of the Huffman code are stored the other way instead .

Does Dynamic Huffman Coding store them the same way?

Is there something I am missing or that I should be aware?

2

2 Answers

9
votes

First off, you don't need to do what you're doing, since it has already been done for you in zlib, a free compression library that permits commercial use. zlib provides implementations of deflate compression and inflate decompression, per RFC 1951. Also you can use minizip, which is included as a third party contribution in the zlib source code package, or libzip, to process zip files using zlib.

If you are intent on doing it yourself, then you can look at puff.c, also in the zlib distribution, which was written with the purpose of supplementing RFC 1951 with an unambiguous definition of the deflate format by virtue of being a heavily commented working deflate decoder.

RFC 1951 does in fact explain the format with exactitude. You just need to read it carefully. puff.c may help accelerate the learning process.

The correct term for the non-fixed Huffman coding is "dynamic". Not "adaptive". The reason is that the term "Adaptive Huffman" refers to something else -- a special technique that is not used in the deflate format -- where the Huffman tree actually morphs as you move through the data. Dynamic Huffman coding, what deflate uses, instead breaks the data up into blocks, and sends a complete Huffman code for each block that is constant throughout that block.

Yes, the dynamic Huffman codes and extra bits are stored in the same order as the fixed Huffman codes. The tricky part is understanding how the Huffman codes are transmitted in the deflate stream header for each block.

1
votes

I think you may be mistaken about the way the extra bits of Huffman codes are stored. RFC1951 says the following about the representation of extra bits:

The extra bits should be interpreted as a machine integer stored with the most-significant bit first, e.g., bits 1110 represent the value 14.

The extra bits are part of the Huffman code and are therefore read in the same order; among other things this means that there will be many ranges of length and distance values to encode that have contiguous Huffman codes (i.e., if code 269 of the length/literal alphabet winds up with the bitstring "1010", then the lengths 19-22 will have the codes 101000, 101001, 101010 and 101011 respectively.)