2
votes

I'm trying to find a correct implementation of LZ77, the original famous algorithm in the 1977 paper. What I have found is a number of different implementations that produce varying output, but still labelled LZ77. Some use hash tables for example, something used in the more 'official' algorithms like LZRW or LZJB. So I am confused.

Some implementations I have tested:

  1. https://gist.github.com/fogus/5401265 (C, 742 > 538 bytes, hashtable? jumbled output)
  2. https://sourceforge.net/projects/crush (C++, 742 > 508 bytes, hashtable? jumbled output)
  3. https://github.com/cstdvd/lz77 (C, 742 > 642 bytes -- contains readable ASCII in output)
  4. http://lab.polygonpla.net/js/tinylz77.html (JS, 742 > 863 bytes!! -- contains readable ASCII in output)
  5. http://geocities.com/diogok_br/lz77/demo.html (JS, 742 > 658 bytes -- contains readable ASCII in output)
  6. github.com/olle/lz77-kit/src/main/js/lz77.js (JS, 742 > 639 bytes -- contains readable ASCII in output)
  7. https://github.com/Favrito/LZ77 (C, 742 > 755 bytes!!)

As far as I can tell, none use any post-processing coding such as Huffman, etc.

Text I used to compress:

Oho! Oho! Rise up, O Teti!
Take your head, collect your bones,
Gather your limbs, shake the earth from your flesh!
Take your bread that rots not, your beer that sours not,
Stand at the gates that bar the common people!
The gatekeeper comes out to you, he grasps your hand,
Takes you into heaven, to your father Geb.
He rejoices at your coming, gives you his hands,
Kisses you, caresses you,
Sets you before the spirits, the imperishable stars...
The hidden ones worship you,
The great ones surround you,
The watchers wait on you,
Barley is threshed for you,
Emmer is reaped for you,
Your monthly feasts are made with it,
Your half-month feasts are made with it,
As ordered done for you by Geb, your father,
Rise up, O Teti, you shall not die!

All of them have a different output stream. Is there no pure, reference implementation or standard of LZ77 to check against?

Why don't all "LZ77" compressors give the same compression ratio, the same output bitstream?

1

1 Answers

6
votes

There isn't one specific way to implement LZ77

LZ77 only provides the general mathematical concepts of the algorithm itself. It is flexible in that its parameters can be altered, resulting in different requirements for the encoder and decoder, and can drastically affect the resulting data stream. It is up the implementation to decide those details, such as the size of the buffers and how the codewords are constructed. The sensitivity of these parameters is why competing implementations may call themselves LZ77 but be incompatible.

For example, the DEFLATE specification specifies a 32768 window size, and stores the position and length as a 15+8 bit codeword. A simpler but less efficient implementation may choose 12-bit distance and 4-bit length, giving a 4096-byte window size. Another may choose a 8192-byte window size, using 13-bits to express the distance, leaving only 3-bits for the length if one is to use 16-bits per token.

This freedom leads to innovations in other ways, such as LZSS introducing literal flags, or LZRW using hash tables. Another popular innovation is following up LZ-based compression with Huffman coding (as in DEFLATE) or another entropy encoder to improve compression ratios.