1
votes

I am new to Hadoop Mapreduce. I have a requirement where lets say I want to find the student name with highest mark. Consider the sample dataset

Harry Maths 80

Harry Physics 67

Daisy Science 89

Daisy Physics 90

Greg Maths 70

Greg Chemistry 79

I know that reducer iterates over each of the unique key, hence I am going to get 3 output key value pairs with name and total marks. But I need the name of the student with the total highest mark ie. Reducer output -> Daisy 179

Following is the reduce function I have written :

 static int maxMark = 0;
 static Text name = new Text();
 public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, 
    OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
   int totalMarks = 0;
   while(values.hasNext())
   {
      totalMarks+=values.next().get();
   }    
   if (totalMarks > maxMark){
      maxMark = totalMarks;
      name = key;
          output.collect(name, new IntWritable(maxMark));
   }

}

But this logic is going to output the previously saved student's name and mark as well! I can solve this problem if I know the number of input keys to the reducer before the reducer is even called, so that when the reducer iterates over the last key (name), I can call output.collect(name, new IntWritable(maxMark)); once..

So, is there a way to find the number of input keys to the reducer? Or else, what are the other alternatives to get one single output from reducer?

1

1 Answers

2
votes

You need two map reduce jobs. The first will total up the marks by name, irrespective of group. Then you can run a job with a mapper that turns the keys and values around, so the key is the sum of marks from the previous step, making sure to use a descending comparator. Configure this job to use only a single reducer task and it can flag itself to ignore all but the first call to reduce.