Here are five ways with gzip, three needing an index, two not.
It is possible to create an index for any gzip file, i.e. not specially constructed, as done by zran.c. Then you can start decompression at block boundaries. The index includes the 32K of uncompressed data history at each entry point.
If you are constructing the gzip file, then it can be made with periodic entry points whose index does not need uncompressed history at those entry points, making for a smaller index. This is done with the Z_FULL_FLUSH
option to deflate()
in zlib.
You could also do a Z_SYNC_FLUSH
followed by a Z_FULL_FLUSH
at each such point, which would insert two markers. Then you can search for the nine-byte pattern 00 00 ff ff 00 00 00 ff ff
to find those. That's no different than searching for the six-byte marker in bzip2 files, except that a false positive is much less likely with nine bytes. Then you don't need a separate index file.
Both gzip and xz support simple concatenation. This allows you to easily prepare an archive for parallel decompression in another way. In short:
gzip < a > a.gz
gzip < b > b.gz
cat a.gz b.gz > c.gz
gunzip < c.gz > c
cat a b | cmp - c
will result in the compare succeeding.
You can then simply compress in chunks of the desired size and concatenate the results. Save an index to the offsets of the start of each gzip stream. Decompress from those offsets. You can pick the size of the chunks to your liking, depending on your application. If you make them too small however, compression will be impacted.
With simple concatenation of gzip files, you could also forgo the index if you made each chunk a fixed uncompressed size. Then each chunk ends with the same four bytes, the uncompressed length in little-endian order, e.g. 00 00 10 00
for 1 MiB chunks, followed by 1f 8b 08
from the next chunk, which is the start of a gzip header. That seven-byte marker can then be searched for just like the bzip2 marker, though again with a smaller probability of false positives.
The same could be done with concatenated xz files, whose header is the seven bytes: fd 37 7a 58 5a 00 00
.