I am trying to better understand how the output of compression algorithms - such as zlib - compares with one's theoretical expectation. So I have a couple of questions.
(1) First I'd like to check that I am correctly computing the compression ratio. Say I want to compress an array of 1000 ones, I can do the following
# encode the array such that len(s) == 1000 bytes
s = np.ones(1000, dtype='uint8').tostring()
# compress using the python zlib (deflate)
comp_s = zlib.compress(s, 9)
# giving comp_s = 'x\xdacd\x1c\x05\xa3`\x14\x0cw\x00\x00\xa7e\x03\xe9'
comp_ratio = len(comp_s)/len(s)
# giving 17/1000
Hence my first question: is comp_s
encoded such that its length corresponds to the number of bytes? I couldn't quite figure out how this string is encoded. If I do sys.getsizeof(comp_s)
I find that it's size 54 bytes instead of 17 bytes? Since getsizeof
returns the size of the python object so it certainly overestimates the size of the string, am I correct assuming that sys.getsizeof(s) - sys.getsizeof('')
is the correct way? It seems to yield the same result as len()
at least.
(2) The size of a compressed sequence should be greater than (or equal to) its Shannon entropy. For a random binary sequence of 1's and 0's occurring with 50:50 probability, the number of information per digit is 1-bit (by definition h = - p log p - (1-p)log(1-p)
). Since a truly random sequence is incompressible, if I generate a random binary sequence of length n
I would expect that by adding one random digit, the resulting n+1
long sequence will be on average 1-bit larger after compression.
When I do the following
rawsize = range(1, 100000, 1000)
compsize = []
for l in rawsize:
s = np.random.randint(0, 2, l, dtype='uint8').tostring()
comp_s = zlib.compress(s, 9)
# note: I compress again to achieve better compression when l is large
comp_s = zlib.compress(comp_s, 9)
compsize.append(len(comp_s))
If I plot compsize / rawsize
I find that the curve approaches a constant value around 0.155
meaning (if I interpret correctly) that by adding one digit the amount of information increases by 0.155
-bits. I do not understand this, as it seems that the compression is doing much better than the theoretical expectation.
To understand this further I also compared the compressed string size for the case of binary sequences of 1's and 0's where the 1's occur with probability 0<p<1
. Then, the compressed size of the string (per digit) should trace the Shannon entropy and be maximal (=1)
at p=0.5
. I find that the curve for the compressed string size (per digit) lies far below the Shannon entropy and that if I multiply the Shannon entropy by 0.155
they roughly lie on top of each other.
Clearly there is some normalisation factor I am not accounting for, but I can't figure out the rationale for it. I have also tried to encode the original sequence using 16
, 32
and 64
bits unsigned integers and found that the ratio compsize / rawsize
becomes roughly 0.176
, 0.2
, 0.23
, respectively, so it looks like that by adding one byte in the representation of 1's and 0's we contribute about 0.25
bits of extra information, this is also curious.
Any suggestion would be really useful!
1
bit longer claim, that as you stated correctly that would be for truly random sequences, which is not the common case. – Urielnp.random.bytes(N)
andcompsize / rawsize
approaches1
in that case, maybe it gives some useful hint to what's going on. – stefano