Process gzip files with Hadoop-streaming

Question

I saw many posts about output the hadoop MapReduce result to gzip format or any other compressed format. However, I don't see much about how hadoop-streaming read in (input) the compressed format. I saw some older post about using -jobconf stream.recordreader.compression=gzip http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200907.mbox/%[email protected]%3E to do the input part. Currently, I am using Cloudera CDH 5 on Ubuntu LTS 12.04. Writing mapper and reducer with python.

Have you tried anything? Streaming should handle gzip'd files automatically. — libjack

Bret VvVv Bret VvVv · Accepted Answer · 2014-10-29T23:08:37

No additional command line args are needed, Gzip input is natively supported by Hadoop streaming jobs. Gzip files will automatically be detected and decompressed. Just pass gzip files using the -input option. Here is a very simple example:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/johndoe/test_input.gz -output /user/johndoe/output -mapper /bin/cat -reducer /usr/bin/wc

In terms of input, using a Python mapper and reducer will not change anything.

One caveat I've notice but have yet to resolve: using gzip input with the -inputreader "StreamXmlRecordReader,begin=page,end=/page" option produces no output.

Process gzip files with Hadoop-streaming

1 Answers