hadoop MapReduce: find max key value pair from output of mapper

Question

It sounds like a simple job, but with MapReduce it doesn't seem that straight-forward.

I have N files in which there is only one line of text for each file. I'd like the Mapper to output key value pairs like < filename, score >, in which 'score' is an integer calculated from the line of text. As a sidenote I am using the below snippet to do so (hope it's correct).

 FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
 String fileName = fileSplit.getPath().getName();

Assuming the mapper does its job correctly, it should output N key value pairs. Now the problem is how should I program the Reducer to output the one key value pair with the maximum 'score'?

From what I know Reducer only works with key value pairs that share the same key. Since the output in this scenario all have different keys, I am guessing something should be done before the Reduce step. Or perhaps should the Reduce step be omitted altogether?

Nishant Nagwani Nishant Nagwani · Accepted Answer · 2011-12-13T06:47:06

You can use the setup() and cleanup() methods (configure() and close() methods in old API). Declare a global variable in reduce class, which determines the maximum score. For each call to reduce, you would compare the input value (score) with the global variable.

Setup() is called once before all reduce invocations in the same reduce task. Cleanup() is called after last reduce invocation in the same reduce task. So, if you have multiple reducers, Setup() and cleanup() methods would be called separately on each reduce task.

hadoop MapReduce: find max key value pair from output of mapper

5 Answers