I am new to Hadoop Mapreduce. I have a requirement where lets say I want to find the student name with highest mark. Consider the sample dataset
Harry Maths 80
Harry Physics 67
Daisy Science 89
Daisy Physics 90
Greg Maths 70
Greg Chemistry 79
I know that reducer iterates over each of the unique key, hence I am going to get 3 output key value pairs with name and total marks. But I need the name of the student with the total highest mark ie. Reducer output -> Daisy 179
Following is the reduce function I have written :
static int maxMark = 0;
static Text name = new Text();
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
int totalMarks = 0;
while(values.hasNext())
{
totalMarks+=values.next().get();
}
if (totalMarks > maxMark){
maxMark = totalMarks;
name = key;
output.collect(name, new IntWritable(maxMark));
}
}
But this logic is going to output the previously saved student's name and mark as well! I can solve this problem if I know the number of input keys to the reducer before the reducer is even called, so that when the reducer iterates over the last key (name), I can call output.collect(name, new IntWritable(maxMark)); once..
So, is there a way to find the number of input keys to the reducer? Or else, what are the other alternatives to get one single output from reducer?