Optimal cofiguration for hadoop?

Question

I have 4 nodes, each with 24 cpus and 7 disks. On each node I've copied from local a 500GB file. So now I have 4 files. Each file's blocks are on a single node, spread on all disks.

What is the optimal configuration for Hadoop's mapreduce for this setup (I'm using it for only these files)? I've tried setting mapred.map.tasks to 96, but hadoop creates just 4 tasks (one per node).

This question seems to be already answered on the hadoop mailing list. Please provide the final answer and accept it. — Pradeep Gollakota

Donald Miner Donald Miner · Accepted Answer · 2013-12-12T01:12:36

You are setting the wrong configuration parameter for what you are trying to do. You want mapred.tasktracker.map.tasks.maximum instead. What you are setting is the number of map tasks for a job... which in most cases you should never modify. Hadoop will set mapred.map.tasks to the number of blocks by default, so just leave it alone.

Add this to mapred-site.xml:

<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>24</value>
</property>

After changing this, you need to restart your tasktrackers. To verify you made the change, take a look at the JobTracker web interface. You should see something near the top that tells you how many map slots you have open. See that it is 96, not 16.

The way the resource allocation works is your MapReduce cluster has a number of map slots and reduce slots. A job will consume map slots when the job runs. If the job has more map tasks than map slots (pretty typical), then your map tasks will be queued up behind the first running map tasks and run later.

That's what you are seeing when each node gets 4 tasks each. It'll eventually run through all of them. But, you are right that with 24 cores (that's 2 CPUs hyperthreaded I assume?) and 7 disks that you want more slots. I've heard rule of thumbs of 1 per disk, 1 per core, 1 per core (with hyperthreading), but there is no real science behind it and it is totally workload dependent. If you really want to get the most performance, just try different values. I suggest values between 10 and 24 map slots per node.

Optimal cofiguration for hadoop?

1 Answers