General Method for Determining Hadoop Conf Settings on a Single Node Cluster

Question

I'm wondering how best to determine the appropriate numbers of map and reduce tasks and the corresponding maximum size of the JVM heap? For those new to Hadoop these are properties set in the mapred-site.xml file. Is there a general formula I can follow based on the number of (virtual) cores and RAM?

In your response, consider the various additional Hadoop processes that are created before/during job processing and their impact on RAM usage (see: https://forums.aws.amazon.com/thread.jspa?threadID=49024)

How does you answer change when shifting from single machine cluster to two machine cluster?

Setjmp Setjmp · Accepted Answer · 2011-11-18T19:31:42

Time has passed and no one has tried to formulate an answer. So I will put forth some ideas in the hope that others will point out flaws if they exist.

The most important thing in configuring Hadoop is to not allow too many resources to be consumed; jobs will fail and the exceptions are not always helpful in quickly determining what went wrong. Particularly the memory resource will cause an immediate crash, and as pointed out by the question the JVM may try to request an unnecessary amount of memory.

We must account for processes other than the map and reduce (like the sorting that occurs between map and reduce). Unfortunately, no one has come forward with a proposal of how many processes may exist at the same time.

So here is my proposal. If number of mappers is M and number of reducers is R, and total virtual RAM on the box is G. I am currently allocating G/(2*M + R) amount of RAM to each process. The factor of 2 assumes there is one extra process sorting the output of each map process or performing other supporting work. Finally I ensure that 2*M + R < P, where P is the number of processors (consider hyper-threading where available in computing P) on the box to prevent too much context switch.

So far I haven't taken down my box with this approach.

General Method for Determining Hadoop Conf Settings on a Single Node Cluster

1 Answers