3
votes

I saw many posts about output the hadoop MapReduce result to gzip format or any other compressed format. However, I don't see much about how hadoop-streaming read in (input) the compressed format. I saw some older post about using -jobconf stream.recordreader.compression=gzip http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200907.mbox/%[email protected]%3E to do the input part. Currently, I am using Cloudera CDH 5 on Ubuntu LTS 12.04. Writing mapper and reducer with python.

1
Have you tried anything? Streaming should handle gzip'd files automatically. - libjack

1 Answers

5
votes

No additional command line args are needed, Gzip input is natively supported by Hadoop streaming jobs. Gzip files will automatically be detected and decompressed. Just pass gzip files using the -input option. Here is a very simple example:

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/johndoe/test_input.gz -output /user/johndoe/output -mapper /bin/cat -reducer /usr/bin/wc

In terms of input, using a Python mapper and reducer will not change anything.

One caveat I've notice but have yet to resolve: using gzip input with the -inputreader "StreamXmlRecordReader,begin=page,end=/page" option produces no output.