Consider a WordCount problem for a MapReduce Program.
Let us consider Mapper Output is as follows: Hello 1 World 1 Hello 1 Hadoop 1 Hello 1 Hadoop 1
It goes to partitioner( we specify for 2 as no of reducer,)
Now mapoutput get partition in 2 parts
part1 :
Hello 1
Hello 1
Hello 1
Part2: World 1 Hadoop 1 Hadoop 1
Since at reducer: we get input as Hello [1,1,1]
World [1]
Hadoop [1,1]
Please clarify my understanding when this merging of Value happens. for a MapReduce: K1, V1 ->(Mapper o/p) K2, V2 -> (Sort and Shuffle) K3, [V3] -> (reducer o/p) K4, v4
My query is when this merging of value happens, before execution of Combiner or after execution of Combiner(during sort and shuffle). or merging of values happens before giving input to Reducer at reducer level.
Since as per my understanding: Mapper output first goes to in memory when it crosses the threshold of mapreduce.task.io.sort.mb it is spilled to local disk, but before spilling data is sorted by partitions, and within each partition it is sorted by the key after sort Combiner get called to reduce the size. After Mapper completion, spill files get merged and combiner get called depending upon the min.num.spills.for.combine value.
Since, in word count problem the reducer do the accumulation of all values of iterable for each specific key and write the output key and sum of values.
Since Combiner is mini reducer, and we specify the same reducer class for the combiner
Job.setCombinerClass(Reduce.class);
then do call of Combiner before merge is worthful during sort and shuffle or my understanding is not proper.
Please clarify me