You are setting the wrong configuration parameter for what you are trying to do. You want mapred.tasktracker.map.tasks.maximum
instead. What you are setting is the number of map tasks for a job... which in most cases you should never modify. Hadoop will set mapred.map.tasks
to the number of blocks by default, so just leave it alone.
Add this to mapred-site.xml
:
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>24</value>
</property>
After changing this, you need to restart your tasktrackers.
To verify you made the change, take a look at the JobTracker web interface. You should see something near the top that tells you how many map slots you have open. See that it is 96, not 16.
The way the resource allocation works is your MapReduce cluster has a number of map slots and reduce slots. A job will consume map slots when the job runs. If the job has more map tasks than map slots (pretty typical), then your map tasks will be queued up behind the first running map tasks and run later.
That's what you are seeing when each node gets 4 tasks each. It'll eventually run through all of them. But, you are right that with 24 cores (that's 2 CPUs hyperthreaded I assume?) and 7 disks that you want more slots. I've heard rule of thumbs of 1 per disk, 1 per core, 1 per core (with hyperthreading), but there is no real science behind it and it is totally workload dependent. If you really want to get the most performance, just try different values. I suggest values between 10 and 24 map slots per node.