3
votes

I am interested in learning about the deflate compression algorithm, particularly how is it represented in a data-stream, and feel that I would greatly benefit from some extra examples (eg. the compression of a short string of text, or the decompression of a compressed chunk). I am continuing to study some resources I have found: ref1, ref2, ref3 but these do not have many examples of how the actual compression looks as a data-stream.

If I could get a few examples of how some strings would look before and after being compressed, and an explanation of the relationship between them that would be fantastic.

Also if there are other resources that I could be looking at please add those.

2

2 Answers

2
votes

You can compress example data with gzip or zlib and use infgen to disassemble and examine the resulting compressed data. infgen also has an option to see the detail in the dynamic headers.

1
votes

+1 for infgen, but here's a slightly more detailed answer.

You can take a look at the before- and after- using gzip and any hex editor. For example, xxd is included on most linux distros. I'd included both raw hex output (not that interesting without understanding) and infgen's output.

  • hello hello hello hello (triggers static huffman coding, like most short strings).
~ $ echo -n "hello hello hello hello" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 cb48 cdc9 c957  ...........H...W
00000010: c840 2701 e351 3d8d 1700 0000            .@'..Q=.....

~ $ echo -n "hello hello hello hello" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
fixed
literal 'hello h
match 16 6
end
!
crc
length
  • \xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1 (triggers uncompressed mode)
~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 010f 00f0 ffff  ................
00000010: fefd fcfb faf9 f8f7 f6f5 f4f3 f2f1 c6d3  ................
00000020: 157e 0f00 0000                           .~....

~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
stored
data 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241
end
!
crc
length
  • abaabbbabaababbaababaaaabaaabbbbbaa (triggers dynamic huffman coding)
~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 1dc6 4901 0000  ............I...
00000010: 1040 c0ac a37f 883d 3c20 2a97 9d37 5e1d  .@.....=< *..7^.
00000020: 0c6e 2934 9423 0000 00                   .n)4.#...

~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | ./infgen/a.out -i -d
! infgen 2.4 output
!
gzip
!
last
dynamic
count 260 7 18
code 1 4
code 2 1
code 4 4
code 16 4
code 17 4
code 18 2
zeros 97
lens 1 2
zeros 138
zeros 19
lens 4
repeat 3
lens 2
zeros 3
lens 2 2 2
! litlen 97 1
! litlen 98 2
! litlen 256 4
! litlen 257 4
! litlen 258 4
! litlen 259 4
! dist 0 2
! dist 4 2
! dist 5 2
! dist 6 2
literal 'abaabbba
match 4 7
match 3 9
match 5 6
literal 'aaa
match 5 5
literal 'b
match 4 1
literal 'aa
end
!
crc
length

I found infgen was still not enough detail to fully understand the format. I look through decompressing all three examples here bit-by-bit, by hand, in detail on my blog

For concepts, in addition to RFC 1951 (DEFLATE) which is pretty good, I would recommend Feldspar's conceptual overview of Huffman codes and LZ77 in DEFLATE