hadoop word count and get the maximum occured word

Question

I am very new to hadoop. i have done with word-count and now I want to do a modification.

I want to get the word that has occurred most in a text file. If, normal word count program gives a output :

a 1
b 4
c 2

I want to write program that will give me output only

b 4

here my reducer function ::

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> 
{

 int max_sum=0;
 Text max_occured_key;

 public void reduce(Text key, Iterable<IntWritable> values, Context context) 
  throws IOException, InterruptedException 
  {
    int sum = 0;
    for (IntWritable val : values) 
    {
        sum += val.get();           
    }
    if(sum > max_sum)
    {
        max_sum = sum;
        max_occured_key = key;

    }

    context.write(max_occured_key, new IntWritable(max_sum));
    //context.write(key, new IntWritable(sum));

  }

}

but it is not giving the right output. Can anyone help plz ?

Chris White Chris White · Accepted Answer · 2013-01-14T12:17:13

You're writing out the maximum value so far at the end of each reduce function - so you'll get more than a single entry per reducer. You're also running into reference re-use problems as you're copying the reference of the key to your max_occured_key variable (rather than copying the value).

You should probably amend as follows:

Initialize the max_occured_key variable at construction time (to an empty Text)
Call max_occured_key.set(key); rather than using the equals assignment - The reference the key parameter is reused for all iterations of the reduce method, so the actual object will remain the same, just the underlying contents will be amended per iteration
Override the cleanup method and move the context.write call to that method - so that you'll only get one K,V output pair per reducer.

For example:

@Override
protected void cleanup(Context context) {
  context.write(max_occured_key, new IntWritable(max_sum));
}

The cleanup method is called once all the data has been passed through your map or reduce task (and is called per task instance (so if you gave 10 reducers, this method will be called for each instance).

hadoop word count and get the maximum occured word

1 Answers