1
votes

I am writing to a file some large Java object I created, and later reading it back. I am using compression since the object is pretty large and I have around 600 different instances of it (each one in a separate file). I am currently using bzip2 with Apache's org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream:

import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
import org.apache.commons.lang3.SerializationUtils;

InputStream in = new BZip2CompressorInputStream(new FileInputStream("myfile.bz2"));
Document doc = (Document) SerializationUtils.deserialize(in);

The problem is that currently decompression takes a long time (over 10 seconds), so reading all 600 objects takes around two hours. I would like to either use a faster compression class, or control the current class's parameters so that decompression would be faster (I am most worried about decompression time as it occurs many times, slow compression is bearable). I am also willing to pay the price of a larger compressed file, for decompression speed.

When compressing using different software you can usually choose "compression level", with values like "Fastest", "Fast", "Normal", "Best". Sometimes you even get more parameters like "Compression Method", "Dictionary Size", "Word Size", etc.

Does anybody know how to control these parameters via code, and what are some recommended values? Or just knows of fast-decompression classes?

1
What is the bottleneck when decompressing? Could be something else like unbuffered input. - Thorbjørn Ravn Andersen
Also the speed is due to the size of the data structure needed. The larger it is, the slower it is to check in and the advantages are diminishing for most kinds of data. - Thorbjørn Ravn Andersen
Just for the fun of it, try do some experiments with all the files being _un_compressed files and see what the speed is then. - Thorbjørn Ravn Andersen

1 Answers

3
votes

BZip2 gets very good compression ratios, but at the expense of being quite slow. At the other end of the spectrum is something like snappy, which is incredibly fast, but does not get as good of compression ratios. GZip is in the middle.

Here is a list of some compression benchmarks in java.