LZ4 compression algorithm explanation

Question

Description from Wikipedia:

The LZ4 algorithm represents the data as a series of sequences. Each sequence begins with a one byte token that is broken into two 4 bit fields. The first field represents the number of literal bytes that are to be copied to the output. The second field represents the number of bytes to copy from the already decoded output buffer (with 0 representing the minimum match length of 4 bytes). A value of 15 in either of the bitfields indicates that the length is larger and there is an extra byte of data that is to be added to the length. A value of 255 in these extra bytes indicates that yet another byte to be added. Hence arbitrary lengths are represented by a series of extra bytes containing the value 255. After the string of literals comes the token and any extra bytes needed to indicate string length. This is followed by an offset that indicates how far back in the output buffer to begin copying. The extra bytes (if any) of the match-length come at the end of the sequence

I didn't understand that at all! Does anyone have an easy way to understand example? For example, in the above explanation what is a literal byte and what is a match? How can we have a decoded output buffer when we're just beginning to compress? Length of what?

The explanation at here was also impenetrable for me.

A simple example would be nice unless you have a better way of explaining it.

eg consider the string of unsigned chars: "hello eddy how are you". How would they get compressed? — ade
The following link proposes a very good explanation of LZ4 compression format, and how it works : brutaldeluxe.fr/products/crossdevtools/lz4/index.html — Cyan
ok so in each compressed block there can be one literal section and many match parts ? This would seem logical as a sequence of bytes could be repeated dozens of times. Is this the case ? — ade

Mark Adler Mark Adler · Accepted Answer · 2014-01-15T15:54:03

First, read about LZ77, the core approach being used. The text is a description of a particular way to code a series of literals and string matches in the preceding data.

A match is when the next bytes in the uncompressed data occur in the previously decompressed data. So instead of sending those bytes directly, a length and an offset is sent. Then you go offset bytes backwards and copy length bytes to the output.

Yes, you can't have a match at the beginning of the stream. You have to start with literals. (Unless there is a preset dictionary, which is another topic.)

LZ4 compression algorithm explanation

1 Answers