4
votes

I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results. The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?

I started looking at prefix trees, but it is not clear how to use them in this context.

Best regards, Jarek

2

2 Answers

3
votes

I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.

The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.

Note that the most common strings should be put at the end, since shorter distances take fewer bits.

0
votes

I don't know how good this is, but it's a dictionary creator: https://github.com/vkrasnov/dictator