Hadoop - get results from output files after reduce?

Question

Given a job with map and reduce phases, I can see that the output folder contains files named like "part-r-00000".

If I need to post-process these files on application level, do I need to iterate over all files in output folder in natural naming order (part-r-00000, part-r-00001,part-r-00002 ...) in order to get job results?

Or I can use some hadoop helper file reader, which will allow me to get some "iterator" and handle file switching for me (when file part-r-00000 is completely read, continue from file part-r-00001)?

DDW DDW · Accepted Answer · 2013-08-26T06:57:08

In mapreduce you specify an output folder, the only thing it will contain will be part-r files (which is the output of a reduce task) and a _SUCCESS file (which is empty). So i think if you want to do postprocessing you only need to set the output dir of job1 as the input dir for job 2.

Now there might be some requirements for your postprocessor which can be addressed, is it for example important to process the output files in order?

Or if you just want to process the files locally then it all depends on the outputformat of your mapreduce job, this will tell you how the part-r files are structured. Then you can simple use standard i/o i guess.

Hadoop - get results from output files after reduce?

3 Answers