2
votes

So from the Hadoop tutorial website (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code) on how to implement word count using a map reduce approach I understand how it works and that the output will be all words with there frequency.

What I want to do is only have the output be the highest frequency word from the Input file I have.

Example: Jim Jim Jim Jim Tom Dane

I want the output just to be

Jim 4

The current Output from Word count is each word and it's frequency. Have anyone edited the Word count so that it just prints the highest frequency word and its frequency?

Does anyone have any tips on how to achieve this?

How would I write another MapReducer that will find the highest frequency word from the Output of WordCount?

Or is there another way?

Any help would be much appreciated.

Thank you!

WordCount.jave:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
3

3 Answers

3
votes

A possible way is to set the number of reducers to "1". Afterwards make a reducer remember the word with the highest frequency and write it to the output in cleanup. Like this:

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private Text tmpWord = new Text("");
    private int tmpFrequency = 0;

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if(sum > tmpFrequency) {
         tmpFrequency = sum;
         tmpWord = key;
      }
    }

    @Override
    public void cleanup(Context context) {
    // write the word with the highest frequency
        context.write(tmpWord, new IntWritable(tmpFrequency));
    }
}
0
votes

You won't be able to do in one step, reduce phase is performed independently for every key (synchronization is not possible) . Solution would be to run new MapReduce job that will aggregate output of your original WordCount job in one key and just select max. GL!

0
votes

If you force a run MapReduce with only one Reduce task, in the code you implement a search of major frequency of all key in a loop.

At the end of this, the output of the loop contain the key with major frequency. This pair you could send to final output (The context.write() sentence should be executed one time at the end).