20
votes

Anyone have a recommendation on a good compression algorithm that works well with double precision floating point values? We have found that the binary representation of floating point values results in very poor compression rates with common compression programs (e.g. Zip, RAR, 7-Zip etc).

The data we need to compress is a one dimensional array of 8-byte values sorted in monotonically increasing order. The values represent temperatures in Kelvin with a span typically under of 100 degrees. The number of values ranges from a few hundred to at most 64K.

Clarifications

  • All values in the array are distinct, though repetition does exist at the byte level due to the way floating point values are represented.

  • A lossless algorithm is desired since this is scientific data. Conversion to a fixed point representation with sufficient precision (~5 decimals) might be acceptable provided there is a significant improvement in storage efficiency.

Update

Found an interesting article on this subject. Not sure how applicable the approach is to my requirements.

http://users.ices.utexas.edu/~burtscher/papers/dcc06.pdf

6
A 'lossy' algorithm IS acceptable because your data is not discrete. There is a real maximum physical rate of change and real accuracy of the sensor - so any lossy encoding with sufficient sampling bandwidth is OK.Martin Beckett
Martin, thanks for your answer. Technically you are correct, but not all design decisions are based purely on technical considerations. In this case we need to preserve the exact values since they represent the "acceptable" results from another vendor's sampling decisions.David Taylor

6 Answers

5
votes

First thing to consider: try compressing the data before you convert it to double precision. Re your comment to David Thornley, unless your IR imaging ADC's have 24 significant bits, 32-bit floats should have more than enough precision; it is only your requirement to exactly preserve the noise generated by subsequent processing that is a problem. Failing that, it might conceivably be practical to reverse-engineer your processing, by determining a table of values it generates, and storing an index to this table instead.

Second: if your compression algorithm knows that your data is in 8-byte chunks, it will be much more effective; this is because it will not throw your most significant bytes in with the least significant bytes. As a crude preprocessing method, you could try prefixing each double with a distinctive byte (ASCII comma, perhaps?) before piping it through a byte-based compressor like gzip; this should result in better total compression even though the intermediate stream is 12% larger. Less crude but more effort would be to write your own compression adapted to this task -- perhaps using an 8-level tree to represent the expected values of each byte in your double.

Third: as image data is highly redundant, some form of delta coding or other image-related compression should save some space. However, it will not gain you a terribly large amount if you demand lossless compression, as the image noise is inherently incompressible. Also, it will not help you deal with the pseudo-random hash in the less-significant bits of your doubles, as explained above.

4
votes

All the encoders you list are byte oriented, and are thrown off by a few properties of doubles. For one there is the layout where the 12-bit exponent/sign doesn't really play well with byte boundaries, for other there's the noisiness of your input. The first part is easy to deal with in multitude of ways, the second will limit the effectiveness of any lossless compression that you throw at it. I think that even the best result will be less than amazing, i don't know your data but i'd suspect you can count on mere 25% save, more or less.

From the top of my head, and perhaps useless because you have thought of everything on this list...

  1. Treat the stream as 64-bit integers and delta-encode adjacent values. If you have runs of values with the same exponent, it will effectively zero it out, as well as possibly some high mantissa bits. There will be overflows, but the data still needs only 64 bits and the operation can be reveresed.

  2. At this stage you can optionally try some crude integer prediction, and save differences.

  3. If you have followed the suggestion before, you will have almost half values starting with 000... and almost half with FFF... To eliminate that, rotate the value to the left (ROL) by 1 bit and XOR it with all Fs if the current LSB is 1. Reverse is XOR with Fs if LSB is 0 then ROR.

On the second thought simply XORing predictions to true values can be better than difference, because you don't have to do step 3 then.

  1. You can try reordering bytes to group bytes with same significance together. Like, first all most significant bytes, and so on. At the very least you should get something like a massive run of zeroes with at most few bits of noise first.

  2. Run through a generic compressor or even first RLE on the run of zeroes, then an entropy encoder, like huffman, or better, range encoder from 7zip/LZMA.

There is one good thing about your data, it is monotonous. There is a bad thing about your data: it's simply too small a set. How much do you want to save, mere kilobyes? what for? The compression effectiveness will suffer a lot if there is often exponent difference between adjacent values.

If you are processing large number of those data sets, you should consider using their similarity to compress them together better - perhaps interleave them at some stage. If you can live with some loss, zeroing out some least significant bytes might be a good idea - perhaps both on source data and on prediction so that you don't reintroduce noise there.

2
votes

If you want high-compression archival storage, "High Throughput Compression of Double-Precision Floating-Point Data" by Burtscher & Patanaworabhan or "Fast and Efficient Compression of Floating-Point Data" by Lindstrom & Isenberg may be helpful to you.

If you want faster dynamic access at the expense of a lower compression rate then a 1D lifting wavelet may be appropriate. You can quantize the data to smaller integers by specifying the number of digits to keep. Then use delta encoding with a predictive model followed by transformation with Haar or a more expensive wavelet transform and arithmetic encoding of the coefficients greater than a specified value.

hope it helps

You can get Lindstrom's ZFP algorithm here: https://github.com/LLNL/zfp

1
votes

Compression algorithms live on repetitions and regularities, and floating-point numbers don't do well at that.

The first question is whether you can use single-precision floating-point values, which will give you 50% compression immediately. Few thermometers are accurate to seven digits, and the exponent will represent temperatures considerably below anything I've been told you can actually get.

If not, can you filter your temperatures, rounding them off to the equivalent of N digits (more likely N/.301 bits)? That may introduce enough regularities to be useful.

If you really have to store 64 bits of information for each temperature reading, and all the bits are significant and not predictable from the other bits, then you effectively can't compress it.

1
votes

Can you do a delta between adjacent values?
Is there a limit to how much a value can change between measurements? Is it acceptable to limit this change to some maximum rate value (at the cost of introducing some smoothing?)

There obviously a limit to the real accuracy of the values from the thermal sensor, do you need to store 64bits of precision or are you better storing an integer number of say 0.01-Kelvin units?

If you can live with some more error and the increase is relatively smooth you might be better just fitting a function to the data and storing only a few terms of the function.

EDIT:
Take a look at a typical data set and look at the range of differences between adjacent values. Then look at what accuracy you need to represent this.

eg. If you have a max 1deg difference between readings you could store changes of 1/256 of this in a byte. If you need to store a bigger range or more accuracy use a short divided by some factor.
So next reading would be = last_reading + (float)increment/256.0

0
votes

You could think about re-coding your data with an entropy coder (Huffman, Shannon-Fano, Arithmetic Coding). But this will only provide good results if you have many repetitions of the datapoints and if you know which symbols will appear with which probability.