MapReduce shuffle phase bottleneck

Question

I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.

However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.

When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?

user3484461 user3484461 · Accepted Answer · 2016-07-05T17:29:25

Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .

If a reduce function is both commutative and associative, then it can be used as a Combiner.

Like in case of Median it won't work .

Combiner can't be used in all the situation.

There are other parameters which can be tuned Like :

When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.

so you can increase the buffer size and increase thresh hold value in that case spillage would be less.

if there are so many spills then spills would merge to make a single file we can play with spill factor.

There are so threads which copies data from local disk to reducer jvm's so their number can be increased.

Compression can be used at intermediate level and final level.

So Combiner is not the only solution and won't be used in all the situation.

MapReduce shuffle phase bottleneck

2 Answers