3
votes

I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.

Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.

enter image description here

Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.

I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.

Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.

While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)

Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.

EDIT:
@MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.

3
Are you aware that the compression ratio heavily depends on the data at hand? For example a 1 byte text file will be at least 45 bytes (if I'm correct) when zipped... Also, you'd have to keep in mind the file system specifics, specifically the size of a block to be able to even estimate this. The least compression ratio is the original data/(static overhead+the original data) in case of uncompressible data... (Just a story: A friend of mine once thought about a compression with a brilliant compression ratio: it would compress everything into 1 byte, and then always extract Microsoft Excel...)ppeterka
@ppeterka66 :thanks,I am aware of what you said. if we use deflate, an zero byte empty file compressed into 2 bytes and 6 byte "header+footer" makes it total 8 bytes which give zero data compression ratio :) I am not concern with header,footer, extremely small dataset and system specific details as i am separately handling that. what i am particularly concern is, real data with minimum size of 1K and which can provide compression ratio less than "0.99"raj_gt1
@raj_gt1: Please update your question to discuss your actual situation. You have tried to cast your real problem into an abstract one dealing with the deflate format. That question does not have a useful answer, since the lower limit on the compression ratio as you have defined it is zero, assuming that the only constraint is the deflate format.Mark Adler
Compressing a random 1024-byte chunk of data, for example from /dev/urandom or other good PRNG like Xorshift, typically produces a deflated block of about 1072 bytes, yielding a deflate compression ratio of 0.96 of less. This shows that ~ 0.96 is common. You can find worse compression ratios in already compressed data. These are all common; the pathological cases that produce much worse compressions are rarer.Nominal Animal
My survey on compression goo.gl/2okQWR (see Section 2.2) shows that most practical techniques on real datasets achieve compression ratio ~2X and some upto 4X, although potential for 16X exists. Techniques with higher ratio have high overhead too.user984260

3 Answers

2
votes

The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.

This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.

This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

3
votes

because the pigeon priciple

http://en.wikipedia.org/wiki/Pigeonhole_principle

you will always have strings that get compressed and strings that get expanded

http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/

theoretically you can achieve best compression with 0 entropy data (infinite compression ratio) and worst compression with infinite entropy data (AWGN noise, so you have 0 compression ratio).

3
votes

I can't tell from your question whether you're using zlib or not. If you're using zlib, it provides a function, deflateBound(), which does exactly what you're asking for, taking an uncompressed size and returning the maximum compressed size. It takes into account how the deflate stream was initialized with deflateInit() or deflateInit2() in computing the proper header and trailer sizes.

If you're writing your own deflate, then you will already know what the maximum compressed size is based on how often you allow it to use stored blocks.

Update: The only way to know for sure the maximum data expansion of a hardware deflator is to obtain the algorithm used. Then through inspection you can determine how often it will emit stored blocks for random data.

The only alternative is empirical and unreliable. You can feed the hardware compressor random data, and examine the results. You can use infgen to disassemble the deflate output and see the stored blocks and their sizes. Then you can write a linear bounding formula for the expansion. Then add some margin to the additive and multiplicative terms to cover for situations that you did not observe in your tests.

This will only work if the hardware deflate algorithm is well behaved, which means that it will not write a fixed or dynamic deflate block if a stored block would be smaller. If it is not well behaved, then all bets are off.