I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results. The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?
I started looking at prefix trees, but it is not clear how to use them in this context.
Best regards, Jarek