0
votes

I am trying to implement LZ77 compression algorithm and encountered this problem.

I am compressing the input (could be any binary file, not only text files) on a byte by byte basis, and I use 3 bytes to represent a pointer/reference to previously substring. The first byte of the pointer is always an escape character, b"\xCC", to make things easier, let's say it's C.

The "standard" way I know when working with escape character is that, you encode all other chars normally, and escape the literal which has the same value as escape char. So 'ABCDE' encoded to 'ABCCDE'.

The problem is that, the value of the pointer could be 'CCx', where the second byte could be 'C' and makes the pointer un-distinguishable from escaped literal 'CC', and this causes problems.

How do I fix that? Or what's the correct/standard way to do LZ77? Thanks!

1
Why is that a problem? If the pointer is of fixed length, then you just read that (and ignore escape characters inside of it). If it is not fixed length, then you need to somehow encode the length. If this doesn't help - can you clarify escape scheme with examples that work and that don't? - domen
For example, we are decompressing the content "CCD", the program can't tell if it's literal C, escaped by C, followed by literal D or a pointer with value "CD" - Jeff

1 Answers

2
votes

For LZ77 to be useful, it needs to be followed by an entropy encoder. It is in that step that you encode your symbols to bits that go in the compressed data.

One approach is to have 258 symbols defined, 256 for the literal bytes, one that indicates that a length and distance for a match follows, and one that indicates end of stream.

Or you can do what deflate does, which is encode the lengths and literals together, so that that symbol decodes to either a literal byte or a length, where a length implies that a distance code follows.

Or you can do what brotli does, which is define "insert and copy" codes, which give the number of literals, that is then followed by that many literal codes and then a copy length and distance.

Or you can invent your own.