Prediciting time or compression ratio for lossless compress of a file?

Question

How would one be able to predict execution time and/or resulting compression ratio when compressing a file using a certain lossless compression algorithm? I am especially more concerned with local compression, since if you know time and compression ratio for local compression, you can easily calculate time for network compression based on currently available network throughput.

Let's say you have some information about file such as size, redundancy, type (we can say text to keep it simple). Maybe we have some statistical data from actual prior measurements. What else would be needed to perform prediction for execution time and/or compression ratio (even if a very rough one).

For just local compression, the size of the file would have effect since actual reading and writing data to/from storage media (sdcard, hard drive) would take more dominant portion of total execution.

The actual compression portion, will probably depend on redundancy/type, since most compression algorithms work by compressing small blocks of data (100kb or so). For example, larger HTML/Javascripts files compress better since they have higher redundancy.

I guess there is also a problem of scheduling, but this could probably be ignored for rough estimation.

This is a question that been in my head for quiet sometimes. I been wondering if some low overhead code (say on the server) can predict how long it would take to compress a file before performing actual compression?

This is pretty broad. If you have "statistical data" (i.e. measurements taken from previous workloads on similar files), then you can presumably get an estimate by interpolation. In the general case, it's not clear there's a general-purpose solution to this. (And how exactly have you measured "redundancy"?) — Oliver Charlesworth

usr usr · Accepted Answer · 2014-10-25T17:56:12

Sample the file by taking 10-100 small pieces from random locations. Compress them individually. This should give you a lower bound on compression ratio.

This only returns meaningful results if the chunks are not too small. The compression algorithm must be able to make use of a certain size of history to predict the next bytes.

Prediciting time or compression ratio for lossless compress of a file?

3 Answers