Hadoop shuffle/merge time: average vs. total

Question

Hadoop outputs the following statistics:

average map time
average reduce time
average shuffle time
average merge time

The total map and reduce time can be obtained by multiplying the number of completed maps/reduces with these averages. But how can the total shuffle/merge time be obtained? Or: how is the average shuffle time calculated?

Ravindra babu Ravindra babu · Accepted Answer · 2015-11-15T05:30:28

Average Map Time = Total time taken by all Map tasks/ Count of Map Tasks

Average Reduce Time = Total time taken by all Reduce tasks/Count of Reduce tasks

Average Merge time = Average of (attempt.sortFinishTime - attempt.shuffleFinishTime)

In Shuffle phase, intermediate data, which was generated by Map tasks is directed to the right reducers. The Shuffle phase assigns keys to reducers & sends all values of a particular key to the right reducer.

Sorting also happens in this phase before sending output values to Reducer.

The shuffle phase involves transfer of data across the network from Map nodes.

From Apache link

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

Sort

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.

The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

Hadoop framework will execute these two phases : shuffling & sorting

Hadoop shuffle/merge time: average vs. total

1 Answers