5
votes

I am storing many chunks of base64 encoded 64-bit doubles in an XML file. The double data all looks similar.

The double data is currently being compressed using the java 'Deflate' algorithm before the encoding, however each chunk of binary data in the file will have its own deflate data dictionary, which is an overhead I would like to greatly lessen. The 'Deflater' class has a 'setDictionary' method which I would like to use.

So questions are:

1). Does anyone have any suggestions for how to best build my own single custom data dictionary based on multiple sections of doubles (x8 bytes) that could he used for multiple deflate operations, i.e. use the same dictionary for all the compressions? Should I be looking for common bytes across all byte arrays, with the commonest byte put at the end of the dictionary array?

2). Can I separate the (custom) data dictionary from the deflated data, and then set the dictionary against the deflated data later before inflating the data again?

3). Will the deflate algorithm take my custom data dictionary, and then just create its own slightly different data dictionary anyway, both invalidating my singular data dictionary and lessening the potential space saving from using a singular data dictionary?

4). Can someone elaborate on the structure of zlib compressed data, so that I myself may try to separate the data dictionary from the compressed data?

I want to only use space for the data dictionary once in my file, and use it for each block of my double data in my filebut not store it with the double data. If the data dictionary cannot be separated from the deflated data/stored separately, then it seems that there would be little value in building a custom singular dictionary as each compressed block would have its own dictionary anyway. Is this right?

1
simplest imo would be to compress the entire xml, rather than to compress across xml elements. Is that feasible for you?Taylor
@Taylor The XML elements (other than my binary) are of a trivial size compared the binary, so I'm not worried about compressing the whole file (i.e. text) - it's the binary I need compressed. It is feasible but compressing the whole file would mean compressing the base64 representation of my binary data, and I want to do the binary compression before base64 encoding.Simon Perkins
The thing is compressing the file is trivial to do. What you're after is a ton of work, a lot more code, and unconventional to boot. All of that will make this difficult to maintain. If you're doing all this work to avoid compressing your document a bit more, or because you want compression to occur before encoding (why?) that doesn't seem like the effort/benefit ratio makes it worthwhile. Just my $0.02Taylor
@Taylor The point is to minimise the space that is taken up in each compressed block by a data dictionary. If all blocks (and there may be thousands) shared a dictionary and did not need one of their own, I'd see that as a potential space saver. Encoding (base64) simply happens so that the data can persist in XML and be transmitted over networks, it really has nothing to do with what I'm trying to do.Simon Perkins
I still don't get why you don't just compress the entire xml doc. Not trying to be a jerk, it just strikes me as so much simpler.Taylor

1 Answers

2
votes
  1. You can either set a fixed dictionary that consists of strings that are common and frequent in your data, or you can use the last n chunks concatenated as a dictionary. Either way, both the compression and decompression ends need the same dictionary to work with on any given chunk.

  2. The dictionary is not sent with the data. That's the whole point. The other side needs to know the dictionary that was used in order to decompress, using some approach like those in #1.

  3. The dictionary deflate uses has no structure. At any point in time, you are using the previous 32K of uncompressed data as the dictionary within which to search for matching strings starting at the next byte after that 32K. Setting the dictionary allows the compressor to get a head start for the first 32K of data. That's all there is to it.

  4. The "dictionary" is in the compressed data simply as what you get when you decompress.