Mapreduce Word Count Hadoop Highest Frequency Word

Question

So from the Hadoop tutorial website (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code) on how to implement word count using a map reduce approach I understand how it works and that the output will be all words with there frequency.

What I want to do is only have the output be the highest frequency word from the Input file I have.

Example: Jim Jim Jim Jim Tom Dane

I want the output just to be

Jim 4

The current Output from Word count is each word and it's frequency. Have anyone edited the Word count so that it just prints the highest frequency word and its frequency?

Does anyone have any tips on how to achieve this?

How would I write another MapReducer that will find the highest frequency word from the Output of WordCount?

Or is there another way?

Any help would be much appreciated.

Thank you!

WordCount.jave:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

possible duplicate of Top N values by Hadoop Map Reduce code — vefthym

Matthias Kricke Matthias Kricke · Accepted Answer · 2015-03-07T11:05:12

A possible way is to set the number of reducers to "1". Afterwards make a reducer remember the word with the highest frequency and write it to the output in cleanup. Like this:

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private Text tmpWord = new Text("");
    private int tmpFrequency = 0;

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if(sum > tmpFrequency) {
         tmpFrequency = sum;
         tmpWord = key;
      }
    }

    @Override
    public void cleanup(Context context) {
    // write the word with the highest frequency
        context.write(tmpWord, new IntWritable(tmpFrequency));
    }
}

Mapreduce Word Count Hadoop Highest Frequency Word

3 Answers