Hadoop MapReduce Jobs for Highest Frequency

Question

I'm trying to use the basic word count as defined here. Is it possible that when the IntSumReducer does context.write, that context.write could be passed to a second reducer or output class that would reduce/change the final list given by the IntSumReducer down to a single largest frequency?

I am quite new to Hadoop/MapReduce and the concept of jobs in Java so I'm uncertain how exactly I would need to modify the default WordCount to comply to make that possible. Could I write a second Reducer function and place it inside of the same job? How would I do that? How would I signal that there is another reducer to be run after IntSumReducer?

Base WordCount:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

   public static class TokenizerMapper
   extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
                ) throws IOException, InterruptedException {
  StringTokenizer itr = new StringTokenizer(value.toString());
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
  }
}
}

public static class IntSumReducer
   extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
                   Context context
                   ) throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
     }
}

  public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}`

Siddharth Srinivasan Siddharth Srinivasan · Accepted Answer · 2017-03-29T07:33:10

What you're looking for is called a Combiner in hadoop, which does some semi-reduction before emitting the output to a final reducer class. For more info on it click here.

Hadoop MapReduce Jobs for Highest Frequency

1 Answers