2
votes

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

In the word count example reduce function is used for both as combiner and reducer.

   public static class IntSumReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

      public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
       int sum = 0;
       for (IntWritable val : values) {
           sum += val.get();
       }
       context.write(key, new IntWritable(sum));
   }
  }

I understood the way reducer works, but in the case of combiner, suppose my input is

  <Java,1> <Virtual,1> <Machine,1> <Java,1>

It consider the first kv-pair and give the same output...!!?? since I've only one value. How come it considers both keys and make

  <Java,1,1>  

since we are considering one kv pair at a time? I know this a false assumption; someone please correct me on this please

3
I know the theory part guys, I want the programmatic explanation here, how the lopping and combining taking place if one kv pair is feed in to combiner at a time how its finding the similar keys!!!KH_AJU

3 Answers

1
votes

The IntSumReducer class inherits the Reducer class and the Reducer class doing the magic here, if we look in to the documentation

"Reduces a set of intermediate values which share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer has 3 primary phases:

Shuffle:The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort:The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged."

The program calling same class for combine and reduce operations;

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

so what I figured out is if we are using only one data node we don't necessarily to call the combiner class for this wordcount program since the reducer class itself take care of the combiner job.

job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);

The above method also have same effect on wordcount program if you using only one data node.

0
votes

Combiner combines the mapper result first before sending to the reducer.

A mapper on a host may output many same key of kv pairs. And combiner will

merge the map output first before sending to reducer, therefore reducing

the shuffle cost between mapper and reducer.

So if a mapper with output (key, 1) (key, 1), combiner will combine the result to (key ,[1,1])

-1
votes

Combiner runs on Map Output. In your case Map Output is like,

<Java,1> <Virtual,1> <Machine,1> <Java,1>,

So it will run for each key, so in your case Java is present two times, hence it is generating result as (Key, [Comma separated Values]).