How to control the number of hadoop streaming output files

Question

Here is the detail:

The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB. What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.

Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?

Thanks in advance!

Donald Miner Donald Miner · Accepted Answer · 2013-10-11T14:54:19

It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.

You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.

This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

How to control the number of hadoop streaming output files

1 Answers