Hadoop WordCount Combiner

Question

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

In the word count example reduce function is used for both as combiner and reducer.

   public static class IntSumReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

      public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
       int sum = 0;
       for (IntWritable val : values) {
           sum += val.get();
       }
       context.write(key, new IntWritable(sum));
   }
  }

I understood the way reducer works, but in the case of combiner, suppose my input is

  <Java,1> <Virtual,1> <Machine,1> <Java,1>

It consider the first kv-pair and give the same output...!!?? since I've only one value. How come it considers both keys and make

  <Java,1,1>

since we are considering one kv pair at a time? I know this a false assumption; someone please correct me on this please

I know the theory part guys, I want the programmatic explanation here, how the lopping and combining taking place if one kv pair is feed in to combiner at a time how its finding the similar keys!!! — KH_AJU

KH_AJU KH_AJU · Accepted Answer · 2016-11-23T05:43:17

The IntSumReducer class inherits the Reducer class and the Reducer class doing the magic here, if we look in to the documentation

"Reduces a set of intermediate values which share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer has 3 primary phases:

Shuffle:The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort:The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged."

The program calling same class for combine and reduce operations;

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

so what I figured out is if we are using only one data node we don't necessarily to call the combiner class for this wordcount program since the reducer class itself take care of the combiner job.

job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);

The above method also have same effect on wordcount program if you using only one data node.

Hadoop WordCount Combiner

3 Answers