0
votes

I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.

However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.

When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?

2

2 Answers

1
votes

Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .

If a reduce function is both commutative and associative, then it can be used as a Combiner.

Like in case of Median it won't work .

Combiner can't be used in all the situation.

There are other parameters which can be tuned Like :

When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.

so you can increase the buffer size and increase thresh hold value in that case spillage would be less.

if there are so many spills then spills would merge to make a single file we can play with spill factor.

There are so threads which copies data from local disk to reducer jvm's so their number can be increased.

Compression can be used at intermediate level and final level.

So Combiner is not the only solution and won't be used in all the situation.

1
votes

Lets say the mapper is emitting (word, count). If you don't use combiner then if a mapper has the word abc 100 times then the reducer has to pull (abc, 1) 100 times Lets say the size of (word, count) is 7 bytes. Without combiner, the reducer has to pull 7 * 100 bytes of data where as the with combiner, the reducer only needs to pull 7 bytes of data. This example just illustrates how the combiner can reduce network traffic. Note : This is a vague example just to make the understanding simpler.