hadoop job output files

Question

I currently have one hadoop oozie job running. The output files are automatically generated. The expected number of output files is just ONE; however, there are two output files called part-r-00000 and part-r-00001. Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. Sometimes, the second one has, and the first one doesn't. Can anyone tell me why? Also, How to set the output file to part-r-00000?

Tgsmith61591 Tgsmith61591 · Accepted Answer · 2016-02-26T19:26:26

In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx file). If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx, where the numbers denote which reducer wrote it out.

That said, you cannot specify a single output file, but only the directory. To get all of the files from the output directory into a single file, use:

hdfs dfs -getmerge <src> <localdst> [addnl]

Or if you're using an older version of hadoop:

hadoop fs -getmerge <src> <localdst> [addnl]

See the shell guide for more info.

As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator. If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer.

hadoop job output files

3 Answers