Hadoop gzip compressed files

Question

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

I read about the hadoop gzip from http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

Thanks for your help.

Niels Basjes Niels Basjes · Accepted Answer · 2011-04-13T05:55:25

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

There are atleast three ways of going around that limitation:

As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

HTH

Hadoop gzip compressed files

4 Answers