How to get uncompressed size of a > 4GB .gz file in python

Question

So there is this super interesting thread already about getting original size of a .gz file. Turns out the size one can get from the 4 file ending bytes are 'just' there to make sure extraction was successful. However: Its fine to rely on it IF the extracted data size is below 2**32 bytes. ie. 4 GB.

Now IF there are more than 4 GB of uncompressed data there must be multiple members in the .gz! The last 4 bytes only indicating the uncompressed size of the last chunk!

So how do we get the ending bytes of the other chunks? Reading the gzip specs I don't see a length of the

+=======================+
|...compressed blocks...|
+=======================+

Ok. Must depend on the CM - compression method. Which is probably deflate. Let's see the RFC about it. There on page 11 it says there is a LEN attribute for "Non-compressed blocks" but it gets funky when they tell about the Compressed ones ...

I can imagine something like

full_size = os.path.getsize(gz_path)
gz = gzip.open(gz_path)
pos = 0
size = 0
while True:
    try:
        head_len = get_header_length(gz, pos)
        block_len = get_block_length(gz, pos + head_len)
        size += get_orig_size(gz, pos + head_len + block_len)
        pos += head_len + block_len + 8
    except:
        break
print('uncompressed size of "%s" is: %i bytes' % (gz_path, full_size)

But how to get_block_length?!? :|

This was probably never intended because ... "stream data". But I don't wanna give up now. One big bummer already: Even 7zip shows such a big .gz with the exact uncompressed size of just the very last 4 bytes.

Does someone have another idea?

Mark Adler Mark Adler · Accepted Answer · 2019-01-25T07:30:20

First off, no, there do not need to be multiple members. There is no limit on the length of a gzip member. If the uncompressed data is more than 4 GB, then the last four bytes simply represents that length modulo 2³². A gzip file with more than 4 GB of uncompressed data is in fact very likely to be a single member.

Second, the fact that you can have multiple members is true even for small gzip files. The uncompressed data does not need to be more than 4 GB for the last four bytes of the file to be useless.

The only way to reliably determine the amount of uncompressed data in a gzip file is to decompress it. You don't have to write the data out, but you have to process the entire gzip file and count the number of uncompressed bytes.

How to get uncompressed size of a > 4GB .gz file in python

2 Answers