1
votes

eg in a word count job, I have 2 mappers Mapper A and Mapper B.

The output of mapper A is: {hi,1},{hello,1},{hey,1}

The output of mapper B is: {hi,1},{bye,1},{hey,1}

Suppose, there is no combiner and 1 reducer

then, first, shuffling happens

so, in shuffling the output of both the mappers merge and the resultant is:

{hi,[1,1]},{hello,1},{hey,[1,1]},{bye,1}

then sorting happens:

{bye,1},{hello,1},{hey,[1,1]},{hi,[1,1]}

then the reduce function in the reducer task is called which makes the o/p as:

bye,2
hello,1
hey,2
hi,2

Is the above process correct? And does shuffling happens before reduce function is called? or the scheduler just accumulates data together from different mappers but do not group the data with the same key together and this process happens after sorting? Why is sorting useful here?

2

2 Answers

0
votes

The short answer is: yes, shuffling happens before reduce() is called. Sorting is needed to help reducer group values by key.

For more details you can check out the answer here: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

0
votes

Yes, shuffling (and sorting) are performed before the reduce method is called.

Note : However, if you specify zero reducers (setNumReduceTasks(0)), then shuffling and sorting are not performed at all.