eg in a word count job, I have 2 mappers Mapper A and Mapper B.
The output of mapper A is: {hi,1},{hello,1},{hey,1}
The output of mapper B is: {hi,1},{bye,1},{hey,1}
Suppose, there is no combiner and 1 reducer
then, first, shuffling happens
so, in shuffling the output of both the mappers merge and the resultant is:
{hi,[1,1]},{hello,1},{hey,[1,1]},{bye,1}
then sorting happens:
{bye,1},{hello,1},{hey,[1,1]},{hi,[1,1]}
then the reduce function in the reducer task is called which makes the o/p as:
bye,2
hello,1
hey,2
hi,2
Is the above process correct? And does shuffling happens before reduce function is called? or the scheduler just accumulates data together from different mappers but do not group the data with the same key together and this process happens after sorting? Why is sorting useful here?