1
votes

We currently have some data log. The log is append-only, but on each append, the whole log is scanned from the beginning for some consistency checks (certain combinations of events trigger an alarm).

Now, we want to transform that log into a compressed log. Individual log entries are typically a few dozen bytes, so they won't compress well. However, the whole log stream does compress well, enough redundancy is present.

In theory, appeding to the compressed stream should be easy, as the state of the compression encoder can be reconstructed while the log is scanned (and decompressed).

Our current way is to have a compressor with identical settings running during the scan and decompression phase, feeding it with the just decompressed data (assuming it will build the identical state).

However, we know that this is not optimal. We'd like to reuse the state which is build during decompression for the compression of the new data. So the question is: How can we implement the (de)compression in a way that we do not need to feed the decompressed data to a compressor to build the state, but can re-use the state of the decompressor to compress the new data we append?

(We need to do this in java, unfortunately, which limits the number of available APIs. Inclusion of free/open source 3rd party code is an option, however.)

1
Sounds like a plan. What's the question in all this? - Atsby
The question is how to implement the (de)compression in a way that we do not need to feed the decompressed data to a compressor to build the state, but just re-use the state of the decompressor to compress the data we append. - MarkusSchaber
Wouldn't you just hack an implementation of, say, gzip, to add a method to allow a compressor instance to copy a decompressor instance's state? - Atsby
@Atsby: That is a possible solution - however, as we have java, we cannot directly access gzip. And, to be honest, hacking such a function into the internals of a compressor implementation which was not designed with that requirement in mind is something which is not that easy, and there is a high risk that you break something. - MarkusSchaber
I meant a Java implementation of gzip ... maybe jzlib would be a good target. I seriously doubt there is a lib out there that has such a feature by default. - Atsby

1 Answers

0
votes

You probably don't have the interfaces you need in Java, but this can be done with zlib. You could write your own Java interface to zlib to do this.

While scanning you would retain the last 32K of uncompressed data using a queue. You would scan the compressed file using Z_BLOCK in inflate(). That would stop at every deflate block. When you get to the last block, which is identified by the first bit of the block, you would save the uncompressed data of that block, as well as the 32K that preceded it that you were saving in the queue. You would also save the last bits in the previous block that did not complete a byte (0..7 bits). You would then add your new log entry to that last chunk of uncompressed data, and then recompress just that part, using the 32K that preceded it with deflateSetDictionary(). You would start the compression on a bit boundary with deflatePrime(). That would overwrite what was the last compressed block with new compressed block or blocks.