3
votes

my hope for this question (see: bottom) is to lay out as much as I know about the deflate process, and I can receive corrections on areas where I am (perhaps very) misinformed. Hopefully, at the end of it, this question could be a handy resource.

Zlib header

The first two bytes equate to a header for the zlib compression used with the format of (credit)

---CMF---  ---FLG---
0111.1000  1101.0101
CINF -CM-  +-||
           | |+- FCHECK
           | +-- FDICT
           +---- FLEVEL

From RFC 1950, right to left:

  1. FCHECK (1.0101) - validates that CMF & FLG as a 16 bit unsigned int is a multiple of 31

  2. FDICT (0) - if set, indicates preset DICT following immediately after FLG

  3. FLEVEL (11) - compression "intensity" [0-3]

  4. CM (1000) - for compression method, where CM = 8 == "deflate" compression method

  5. CINF (0111) - indicates the size of the sliding window used, where CINF = 7 == 32K sliding window

Data block header

The next three bits in the NEW BYTE equate to a header for the Huffman encoded block:

---CMF---  ---FLG---  NEW BYTE
0111.1000  1101.0101  11101100
                           |-|
                           | +- BFINAL
                           +--- BTYPE

From RFC 1951 right to left:

  1. BFINAL (0) - is set (1) if this is the last block of data

  2. BTYPE (10) - Huffman encoding : (00)none; (01)Fixed Huffman codes; (10) dynamic codes; (11) invalid

The Huffman Codes

From here I will work off the assumption of BTYPE = (10)

The following values immediately proceed:

NEW BYTE                NXT BYTE                  
(11101)100       ->     101)(11101)   ->   0111111(1
       |-|
       | +- BFINAL
       +--- BTYPE
  1. HLIT (11101) - 5-bit number of length/literal codes, 257 added (257-286)

  2. HDIST (11101) - 5-bit number of distance codes, 1 added (1-32)

  3. HCLEN (1111) - 4-bit number of code-length codes, 4 added (4-19)

Immediately following this is HCLEN(don't forget +4) 3-bit fields, where the values are assigned to this sequence, in order:

16 17 18 0 8 7 9 6 10 5 11 4 12 3 13 2 14 1 15

Since HCLEN = 19, the whole sequence is used

A code length of 0 in that sequence means the corresponding symbol is not used.

As a graphical example, after reading the 19x3 bits we have six extra bits(extra bits in brackets):

NXT BYTE 00000000 00000000 00000000 00000000 00000000 00000000 [000000](00

My Question

Do the last bits in brackets above, get thrown away?

1
I wouldn't downvote, but I have my doubts about the value of this question. 1) it's not about PNG, it's about ZLIB/deflate spec (that PNG uses ZLIB/deflate is irrelevant here - the concept of "decoupling layers" is essential to engineering). 2) It's not a real concrete question, it's rather an appeal to help to understand the details of ZLIB/deflate format using an special example (this will hardly help any other people) 3) it's slightly strange (though perhaps also commendable) to pretend to understand a ZLIB stream at the bit level, specially if you don't know about Huffman compressionleonbloy
... and it smells funny in the context of the question title: you normally don't need to understand things at such a low level to understand the PNG format. 4) see here stackoverflow.com/questions/26018811/…leonbloy
I suppose my definition of understanding something in this domain is being able to code it. As in program a decoder. But I would argue the value of the question, it's a great exercise to truly master file I/O with bytes, even down to individual bits. It also involves endian-ness. All areas I am weak in, as a college educated CS student. Maybe the question title was incorrect, can that be changed? Should it be, if it can be? And my (possibly poor) logic for including PNG is that a large amount of searches for deflate would naturally involve PNG. I believe this is awesome stuff to learn from.Trés DuBiel
@leonbloy Also, one of stack overflow's best features, in my opinion, are the truly in-depth answers that dispel any confusion and lay out an optimal way to do X. I was trying to emulate or provoke such an answer from the communityTrés DuBiel
@MarcusJ I initially did all of this because I was studying PNG files. The IDAT chunk is structured exactly how I have it laid out, above. Starting with a zlib header, then the data block header, followed by the huffman code(s), then the encoded data. Your first byte of 0x78 is identical to my example above (CMF byte containing CINF and CM)Trés DuBiel

1 Answers

6
votes

No. The only times in a deflate stream that you skip bits to go to a byte boundary are for a stored block (00), or when the end code in the last block is read. After the bits for the code length code lengths, you continue with the subsequent bits to use the generated Huffman code to read the code lengths.