comparison between the time of a map-reduce job with and without a reducer

Question

In my Hapoop job, when I set the number of reducers to 0, the mapping phase is dramatically faster than the case in which the number of reducers is not 0. In the beginning of the mapping phase there is reducer running, so I don't understand why the mapping time dramatically increases.

Thomas Jungblut Thomas Jungblut · Accepted Answer · 2013-10-28T19:46:57

If you have not configured a reducer, the map output will not be sorted before written to disk.

The reason is that Hadoop uses an external sort algorithm, which means that the map tasks sort their task output [1]. Then the reducer just merges the sorted map output segments together. In case there is no reducer, there is no need to group the data on the key- thus no need to sort.

[1] Addition for possible nit-pickers: A map task starts to sort once its output buffer is filled up. This sorted segment is spilled to disk and merged at the end of the map task with all other spilled segments until a single sorted file emerges. Sending a single file (maybe even compressed) is much more efficient for bandwidth usage / transfer performance. On the reducer side, the sorted files will then be merged again. The very last merge pass is directly streamed into the reduce method.

comparison between the time of a map-reduce job with and without a reducer

1 Answers