1
votes

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.

This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.

So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)

If I deflate using:

deflate(stream, Z_PARTIAL_FLUSH)

so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?

Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)

Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.

Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.

But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?

(Also, I know about deflateBound, but would rather avoid it because of the overhead.)

Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?

2
I'm confused. Are you compressing or decompressing? First you say deflate(), which is compression. Then you say that each chunk is "decompressed". So decompression. Then you say that deflated data will be larger if you're not using compression level 0. So you mean decompression? Then you say that deflate data will be larger than the uncompressed stream for small data sizes, so compression. Then you ask about guaranteeing that the output is smaller than the input (which you can't), so compression.Mark Adler
Sorry for the confusion - I'm compressing. When I said "decompressed" that was a mistake (I edited post to fix.)Siler
Then what does "Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data" mean?Mark Adler

2 Answers

2
votes

Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.

is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?

Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.

(Also, I know about deflateBound, but would rather avoid it because of the overhead.)

Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.

By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.

due to the overhead of things like headers and dictionaries in the LZW stream.

deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.

Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers ...

The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.

1
votes

Information theory dictates that there must always be pathological cases which "compress" to something larger.

This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.

Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.