Wrap deflated data in gzip format

Question

I think I'm missing something very simple. I have a byte array holding deflated data written into it using a Deflater:

deflate(outData, 0, BLOCK_SIZE, SYNC_FLUSH)

The reason I didn't just use GZIPOutputStream was because there were 4 threads (variable) that each were given a block of data and each thread compressed it's own block before storing that compressed data into a global byte array. If I used GZIPOutputStream it messes up the format because each little block has a header and trailer and is it's own gzip data (I only want to compress it).

So in the end, I've got this byteArray, outData, that's holding all of my compressed data but I'm not really sure how to wrap it. GZIPOutputStream writes from an buffer with uncompressed data, but this array is all set. It's already compressed and I'm just hitting a wall trying to figure out how to get it into a form.

EDIT: Ok, bad wording on my part. I'm writing it to output, not a file, so that it could be redirected if needed. A really simple example is that

cat file.txt | java Jzip | gzip -d | cmp file.txt

should return 0. The problem right now is if I write this byte array as is to output, it's just "raw" compressed data. I think gzip needs all this extra information.

If there's an alternative method, that would be fine to. The whole reason it's like this is because I needed to use multiple threads. Otherwise I would just call GZIPOutputStream.

DOUBLE EDIT: Since the comments provide a lot of good insight, another method is that I just have a bunch of uncompressed blocks of data that were originally one long stream. If gzip can read concatenated streams, if I took those blocks (and kept them in order) and gave each one to a thread that calls GZIPOutputStream on its own block, then took the results and concatenated them. In essence, each block now has header, the compressed info, and trailer. Would gzip recognize that if I concatenated them?

Example:

cat file.txt
Hello world! How are you? I'm ready to set fire to this assignment.

java Testcase < file.txt > file.txt.gz

So I accept it from input. Inside the program, the stream is split up into "Hello world!" "How are you?" "I'm ready to set fire to this assignment" (they're not strings, it's just an array of bytes! this is just illustration)

So I've got these three blocks of bytes, all uncompressed. I give each of these blocks to a thread, which uses

public static class DGZIPOutputStream extends GZIPOutputStream
{
    public DGZIPOutputStream(OutputStream out, boolean flush) throws IOException
    {
        super(out, flush);
    }
    public void setDictionary(byte[] b)
    {
        def.setDictionary(b);
    }
    public void updateCRC(byte[] input)
    {
        crc.update(input);
    }                       
}

As you can see, the only thing here is that I've set the flush to SYNC_FLUSH so I can get the alignment right and have the ability to set the dictionary. If each thread were to use DGZIPOutputStream (which I've tested and it works for one long continuous input), and I concatenated those three blocks (now compressed each with a header and trailer), would gzip -d file.txt.gz work?

If that's too weird, ignore the dictionary completely. It doesn't really matter. I just added it in while I was at it.

what makes you think you can write gzip data using multiple threads? i'm pretty sure the gzip process generates some sort of shared data which affects the zipped data. i don't think you can (easily) multi-plex this work across multiple threads. — jtahlborn
Well, it's an assignment. I've noticed that Java doesn't really have much in the way of multithreaded compression. This is also a very simplified version. We literally only need to have the threads compress the data in parallel. — user1777900

Mark Adler Mark Adler · Accepted Answer · 2012-10-28T06:11:06

If you set nowrap true when using the Deflater (sic) constructor, then the result is raw deflate. Otherwise it's zlib, and you would have to strip the zlib header and trailer. For the rest of the answer, I am assuming nowrap is true.

To wrap a complete, terminated deflate stream to be a gzip stream, you need to prepend ten bytes:

"\x1f\x8b\x08\0\0\0\0\0\0\xff"

(sorry -- C format, you'll need to convert to Java octal). You need to also append the four byte CRC in little endian order, followed by the four-byte total uncompressed length modulo 2^32, also in little endian order. Given what is available in the standard Java API, you'll need to compute the CRC in serial. It can't be done in parallel. zlib does have a function to combine separate CRCs that are computed in parallel, but that is not exposed in Java.

Note that I said a complete, terminated deflate stream. It takes some care to make one of those with parallel deflate tasks. You would need to make n-1 unterminated deflate streams and one final terminated deflate stream and concatenate those. The last one is made normally. The other n-1 need to be terminated using sync flush in order to end each on a byte boundary and to not mark it as the end of the stream. To do that, you use deflate with the flush parameter SYNC_FLUSH. Don't use finish() on those.

For better compression, you can use setDictionary on each chunk with the last 32K of the previous chunk.

Wrap deflated data in gzip format

3 Answers