1
votes

I've just set up a Hadoop cluster with Hadoop 0.20.205. I have a master (NameNode and JobTracker) and two other boxes (slaves).

I'm trying to understand, how to define the number of map and reduce tasks to use.

So far I understood that I can set the maximum number of map and reduce tasks that each TaskTracker is able to handle simultaneously with: *mapred.tasktracker.map.tasks.maximum* and *mapred.tasktracker.reduce.tasks.maximum*.

Also, I can define the maximum number of map tasks the whole cluster can run simultaneously with *mapred.map.tasks*. Is that right?

If so, how can I know what should be the value for *mapred.tasktracker.map.tasks.maximum*? I see that the default is 2. But why? What are the pros and cons of increasing or decreasing this value?

2

2 Answers

0
votes

I don't think that there is a rule for that (like the rule for setting the number of reducers).

What I do is, set the number of mappers and reducers to the number of cores available minus 1 for each machine. Intuitively, this will leave each machine some memory for the other processes (like cluster communication). But I may be wrong. Anyway, this is the only thing I found from "Pro Hadoop". It suggests using as many mappers as the number of available cores and one or two reducers. I hope it helps.

0
votes

Here is what I propose. Hope it helps!

  • Run "hadoop fsck /" in the master node to find out the size and number of blocks. For e.g.:

    ...
    Total size: 21600037259 B
    Total dirs: 78
    Total files:    152
    Total blocks (validated):   334 (avg. block size 64670770 B)
    ...
    
  • I set up reduce tasks as num_of_blocks / 10.
    set mapred.map.tasks=33;

  • I set up map tasks as block_size (in MB) * 2.
    set mapred.reduce.tasks=124;

So far that's the best configuration I've found. And you'll have to modify it according your cluster's configuration.