13
votes

I want to concat two or more gzip streams without recompressing them.

I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.

Several notes:

  • Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
  • I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
  • Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.

Note: it is not duplicate of this because proposed solution is not fits my needs.

Clearification edit:

I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"

If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).

Only Opera handles this well.

So I need to create a single gzip stream of several chunks and send them without re-compressing.

Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.

It is still faster 4 times then fastest gzip compression. But it is not enough.

What I need is to find the data I need to save together with gzip file in order to not run decompression procedure, and how do I find this data during compression.

4
Do you really want to compress them or just concatenate them into the same file?Tobias Wärre
I want to create one gzip compressed file/stream/memory-chunk of two other gzip compressed files/streams/memory-chunks without decompressing them, concating them and compressing them once again.Artyom
See clearifications in the edit.Artyom
gzjoin.c needs to decompress the second stream to keep in sync with the stream. Since a zlib stream does not contain an index, this is needed. In theory you could add the index when it is gzipped in advance, and modify gzjoin to use this index. But it's not for the faint of heart...Rutger Nijlunsing
If you would write this as an ansver I would be able to accept it.Artyom

4 Answers

14
votes

Look at the RFC1951 and RFC1952

The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.

To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.

There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.

Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.

6
votes

The gzip manual says that two gzip files can be concatenated as you attempted.

http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

So it appears that the other tools may be broken. As seen in this bug report. http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263

Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.

As others have mentioned you may be able to perform surgery: http://www.gzip.org/zlib/rfc-gzip.html

And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.

In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.

Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.

2
votes

It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.

Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.

Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.

1
votes

If taring them is not out of the question (since the linked cat solution isn't viable for you):

tar cf A_B.gz.tar A.gz B.gz

Then, to get them back:

tar xf A_B.gz.tar