0
votes

I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?

1
A very interesting question! (+1). I once had this problem and didn't find anything more than just send the records of the mapper to the reducer, too, and write everything from the reducer (after filtering which records need further processing). Of course, that was very inefficient, instead of writing things straight from the mapper - vefthym

1 Answers

0
votes

When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer. By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.