2
votes

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.

While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.

Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.

1

1 Answers

1
votes

Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?

Each reducer takes data from several different mappers. Look at this picture (found it here):

enter image description here

Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.

Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:

Merging - combines all key-value pairs which have same keys and returns >.

Sorting - takes output from Merging step and sort all key-value pairs by using Keys. This step also returns (Key, List[Value]) output but with sorted key-value pairs.

Output of shuffle-sort phase is sent directly to reducers.