I currently have one hadoop oozie job running. The output files are automatically generated. The expected number of output files is just ONE; however, there are two output files called part-r-00000 and part-r-00001. Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. Sometimes, the second one has, and the first one doesn't. Can anyone tell me why? Also, How to set the output file to part-r-00000?
3 Answers
In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx
file). If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx
, where the numbers denote which reducer wrote it out.
That said, you cannot specify a single output file, but only the directory. To get all of the files from the output directory into a single file, use:
hdfs dfs -getmerge <src> <localdst> [addnl]
Or if you're using an older version of hadoop:
hadoop fs -getmerge <src> <localdst> [addnl]
See the shell guide for more info.
As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator. If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer.
The output files are by default named part-x-yyyyy where:
- x is either 'm' or 'r', depending on whether this file was generated by a map or reduce task
- yyyyy is the mapper or reducer task number (zero based)
The number of tasks has nothing to do with the number of physical nodes in the cluster. For map task output the number of tasks is given by the input splits. Usually the reducer task are set with job.setNumReduceTasks()
or passed as input parameter.
A job which has 100 reducers will have files named part-r-00000 to part-r-00100, one for each reducer task. A map only job with 100 input splits will have files named part-m-00000 to part-m-00100, one for each reducer task.
The number of files output is dependent on the number of mappers and reducers. In your case, the number of files and names of files indicates that your output came from 2 reducers.
To limit the number of mappers or reducers is dependent on your language (Hive, Java, etc), but each has a property that you can set to limit these. See here for Java MapReduce jobs.
Files can be empty if that particular mapper or reducer task had no resulting data on the given data node.
Finally, I don't think you want to limit your mappers and reducers. This will defeat the point of using Hadoop. If you're aiming to read all files as one, make sure they are consolidated in a given directory and pass the directory as the file name. The files will be treated as one.