So there is this super interesting thread already about getting original size of a .gz file. Turns out the size one can get from the 4 file ending bytes are 'just' there to make sure extraction was successful. However: Its fine to rely on it IF the extracted data size is below 2**32 bytes. ie. 4 GB.
Now IF there are more than 4 GB of uncompressed data there must be multiple members in the .gz! The last 4 bytes only indicating the uncompressed size of the last chunk!
So how do we get the ending bytes of the other chunks? Reading the gzip specs I don't see a length of the
+=======================+
|...compressed blocks...|
+=======================+
Ok. Must depend on the CM - compression method. Which is probably deflate
. Let's see the RFC about it. There on page 11 it says there is a LEN
attribute for "Non-compressed blocks" but it gets funky when they tell about the Compressed ones ...
I can imagine something like
full_size = os.path.getsize(gz_path)
gz = gzip.open(gz_path)
pos = 0
size = 0
while True:
try:
head_len = get_header_length(gz, pos)
block_len = get_block_length(gz, pos + head_len)
size += get_orig_size(gz, pos + head_len + block_len)
pos += head_len + block_len + 8
except:
break
print('uncompressed size of "%s" is: %i bytes' % (gz_path, full_size)
But how to get_block_length
?!? :|
This was probably never intended because ... "stream data". But I don't wanna give up now. One big bummer already: Even 7zip shows such a big .gz with the exact uncompressed size of just the very last 4 bytes.
Does someone have another idea?