10
votes

Given a job with map and reduce phases, I can see that the output folder contains files named like "part-r-00000".

If I need to post-process these files on application level, do I need to iterate over all files in output folder in natural naming order (part-r-00000, part-r-00001,part-r-00002 ...) in order to get job results?

Or I can use some hadoop helper file reader, which will allow me to get some "iterator" and handle file switching for me (when file part-r-00000 is completely read, continue from file part-r-00001)?

3

3 Answers

5
votes

In mapreduce you specify an output folder, the only thing it will contain will be part-r files (which is the output of a reduce task) and a _SUCCESS file (which is empty). So i think if you want to do postprocessing you only need to set the output dir of job1 as the input dir for job 2.

Now there might be some requirements for your postprocessor which can be addressed, is it for example important to process the output files in order?

Or if you just want to process the files locally then it all depends on the outputformat of your mapreduce job, this will tell you how the part-r files are structured. Then you can simple use standard i/o i guess.

8
votes

You can use getmerge command of Hadoop File System(FS) shell:

hadoop fs -getmerge /mapreduce/job/output/dir/ /your/local/output/file.txt
2
votes

You can probably use Hadoop FileSystem to do the iteration from your application of the part-r-xxxxx files.

FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("hdfs://hostname:port/joboutputpath"));
for (int i=0;i<status.length;i++){
    fs.open(status[i].getPath())));
}

You can also look into ChainMapper/ChainReducer.