How can one copy the internal state of zlib compressor object in Python

Question

I have to compress a long list of strings. I have to compress them individually. Each string is less than 1000 chars long. However many of these strings have a common prefix. Therefore I was wondering if I could amortize the compression cost, by compressing the common prefix first and then storing the state of the compressor and feed it the suffix of the strings.

If you have any suggestions about how to accomplish this in Python that would be great. Although I mention zlib in the title any other standard module will work too. In this application speed of decompression does not matter much, so I can afford decompression to be quite slow.

How many strings are in this list? Consider using a en.wikipedia.org/wiki/Trie ? — Mariy
@cldy Oh that is not in my control. I have to compress these strings and then send several different subsets as demanded. — san

Mark Adler Mark Adler · Accepted Answer · 2012-07-26T13:44:06

The Python interface to zlib is rather meager, and does not provide access to all of zlib's capabilities. If you can construct your own interface to zlib, then you can do what you're asking, and more.

The "and more" has to do with the fact that you are compressing very short strings individually, which inherently limits how much compression you can get. Since these strings have some common content, you should use the deflateSetDictionary() and inflateSetDictionary() functions of zlib to take advantage of that fact, and potentially improve the compression significantly. The common content can be the common prefix you mention, as well as common content anywhere else in the string. You would define a fixed dictionary to use for all strings of up to 32K that contains sequences of bytes that appear commonly in the strings. You would put the most common sequences at the end of the 32K, and less common sequences earlier. If there are several classes of these strings with different common sequences, you can if you like create a set of dictionaries and use the dictionary id returned from the first call of inflate() to select the dictionary. For one or several dictionaries, you just need to make sure that the same dictionaries are stored on both the compression and decompression ends.

As for storing the compression state, you can do that with deflateCopy(). This is provided in Python with the copy() method. I'm not sure that that will give you much of a speed advantage though for small strings.

Update:

From recently added comments, I believe that your use case is that you send some of many strings on request to a receiver. There may be a way to get much better compression using the meager Python interface in this case. You can use the flush method with Z_SYNC_FLUSH to force what has been compressed so far to the output. What this would allow you to do is treat the series of strings requested as a single compressed stream.

The process would be that you start a compression object with compressobj(), use compress() on that object with the first string requested, collect the output of that (if any), and then do a flush(Z_SYNC_FLUSH) on the object, collecting the remaining output. Send the combined output of compress() and flush() to the receiver, which has started a decompressobj() and it then uses decompress() on that object with what it was sent, which will return the original string. (No flush is needed on the decompression end.)

So far, the result is not much different than just compressing that first string. The good part is that you repeat that process without creating new compress or decompress objects. Just use compress() and flush() for the next string, and decompress() on the other end to get it. The advantage for the second string, and all subsequent strings, is that they get to use the history of the previous strings for compression. Then you do not need to construct or use any fixed dictionaries. You can just use the history of previously requested strings to provide the fodder needed for good compression. If your strings average 1000 bytes in length, eventually each string sent will benefit from the history of the most recently sent 32 strings, since the sliding window for compression is 32K long.

When you're done, just close the objects.

How can one copy the internal state of zlib compressor object in Python

1 Answers