1
votes

What is the best compression algorithm with the following features:

  • should take less time to decompress (can take reasonably more time compress)
  • should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)

Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?

4
There are not nearly enough constraints on this question. Depends on OS, file system, data being compressed, CPU speed vs. i/o speed. E.g., when compressing many small files, often quicker to compress than decompress as filesystem must created many file entries when decompressing. - D'Arcy Rittich
Hey let's give a little slack! Yes, OS and file system are relevant but you can still measure a compression method against itself for compress/decompress times. Don't be a hater ;) - Dave Swersky
Heh... So the first constraint is, any amount of time is allowed for compression (0-infinity), and decompression must take less than that amount of time. Could be satisfied by an algorithm that takes three months to decompress ten bytes, so long as compressing took three months and 1 second... - Shog9

4 Answers

11
votes

Entire site devoted to compression benchmarking here

1
votes

Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.

But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.

Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.

1
votes

You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.

Decompression either

1) Writes the literal 
2) for (backward position, length)=(m,n) pair, 
   goes back, in the output buffer, m bytes, 
   reads n bytes and 
   writes n bytes at the end of the buffer.

So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.

0
votes

This is an interessing question. On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.