python zlib - size of compressed string vs Shannon entropy

Question

I am trying to better understand how the output of compression algorithms - such as zlib - compares with one's theoretical expectation. So I have a couple of questions.

(1) First I'd like to check that I am correctly computing the compression ratio. Say I want to compress an array of 1000 ones, I can do the following

# encode the array such that len(s) == 1000 bytes
s = np.ones(1000, dtype='uint8').tostring()

# compress using the python zlib (deflate)
comp_s = zlib.compress(s, 9) 
# giving comp_s = 'x\xdacd\x1c\x05\xa3`\x14\x0cw\x00\x00\xa7e\x03\xe9'

comp_ratio = len(comp_s)/len(s)
# giving 17/1000

Hence my first question: is comp_s encoded such that its length corresponds to the number of bytes? I couldn't quite figure out how this string is encoded. If I do sys.getsizeof(comp_s) I find that it's size 54 bytes instead of 17 bytes? Since getsizeof returns the size of the python object so it certainly overestimates the size of the string, am I correct assuming that sys.getsizeof(s) - sys.getsizeof('') is the correct way? It seems to yield the same result as len() at least.

(2) The size of a compressed sequence should be greater than (or equal to) its Shannon entropy. For a random binary sequence of 1's and 0's occurring with 50:50 probability, the number of information per digit is 1-bit (by definition h = - p log p - (1-p)log(1-p)). Since a truly random sequence is incompressible, if I generate a random binary sequence of length n I would expect that by adding one random digit, the resulting n+1 long sequence will be on average 1-bit larger after compression.

When I do the following

rawsize = range(1, 100000, 1000)
compsize = []
for l in rawsize:
    s = np.random.randint(0, 2, l, dtype='uint8').tostring() 
    comp_s = zlib.compress(s, 9)
    # note: I compress again to achieve better compression when l is large
    comp_s = zlib.compress(comp_s, 9)
    compsize.append(len(comp_s))

If I plot compsize / rawsize I find that the curve approaches a constant value around 0.155 meaning (if I interpret correctly) that by adding one digit the amount of information increases by 0.155-bits. I do not understand this, as it seems that the compression is doing much better than the theoretical expectation.

To understand this further I also compared the compressed string size for the case of binary sequences of 1's and 0's where the 1's occur with probability 0<p<1. Then, the compressed size of the string (per digit) should trace the Shannon entropy and be maximal (=1) at p=0.5. I find that the curve for the compressed string size (per digit) lies far below the Shannon entropy and that if I multiply the Shannon entropy by 0.155 they roughly lie on top of each other.

Clearly there is some normalisation factor I am not accounting for, but I can't figure out the rationale for it. I have also tried to encode the original sequence using 16, 32 and 64 bits unsigned integers and found that the ratio compsize / rawsize becomes roughly 0.176, 0.2, 0.23, respectively, so it looks like that by adding one byte in the representation of 1's and 0's we contribute about 0.25 bits of extra information, this is also curious.

Any suggestion would be really useful!

I put this as a note for the 1 bit longer claim, that as you stated correctly that would be for truly random sequences, which is not the common case. — Uriel
zlib introduces 6 bytes of overhead, and deflate adds several more. When you double-compress, you'll get the overhead bytes included twice. — Glenn Randers-Pehrson
the overhead is insignificant (max 0.03% according to documentation). The problem that I am reporting is that I seem to be compressing far more efficiently than I should be able to -- about 6 times. As a sidenote, I have tried to do the same with np.random.bytes(N) and compsize / rawsize approaches 1 in that case, maybe it gives some useful hint to what's going on. — stefano

user1620443 user1620443 · Accepted Answer · 2017-05-24T16:35:09

When calling np.random.randint(0, 2, l, dtype='uint8').tostring(), you are not obtaining a random sequence of 0 and 1, but a random sequence of the 8-bit binary representations of 0s and 1s: 10000000 and 00000000. Pretty much 1 every 8 bits is random, the other 7 are all 0s. I guess the optimal ratio should be about 1/8, plus some overhead.

Indeed, if using np.random.randint(0, 256, 100000, dtype='uint8').tostring() instead, the comp_ratio is ~1.

python zlib - size of compressed string vs Shannon entropy

2 Answers