0
votes

I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?

Something about My Hadoop Job: (1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes. (2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.

Time performance plot:

enter image description here

2
maybe data starts fitting in memory.AdamSkywalker
good point @AdamSkywalker. It could also be that a heavy reduce task (e.g., two specific keys with heavy load) is then split into two.vefthym
Thank you for all the comments. I did some more experiments. I think in my case one of the major problem is similar to what vefthym said. Because many keys went to the same reducers (not 100% sure yet, trying to verify), so the slowest reducers slowed down the job.batilei

2 Answers

0
votes

it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.

0
votes

IMHO it could be that with sufficient number of reducers available the network IO (to transfer intermediate results) between each reduce stage decreases.
As network IO is usually the bottleneck in most Map-Reduce programs. This decrease in network IO needed will give significant improvement.