Why the time of Hadoop job decreases significantly when reducers reach certain number

Question

I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?

Something about My Hadoop Job: (1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes. (2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.

Time performance plot:

good point @AdamSkywalker. It could also be that a heavy reduce task (e.g., two specific keys with heavy load) is then split into two. — vefthym
Thank you for all the comments. I did some more experiments. I think in my case one of the major problem is similar to what vefthym said. Because many keys went to the same reducers (not 100% sure yet, trying to verify), so the slowest reducers slowed down the job. — batilei

Saurabh Suman Saurabh Suman · Accepted Answer · 2017-07-01T21:37:49

it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.

Why the time of Hadoop job decreases significantly when reducers reach certain number

2 Answers