Controling and monitorying number of simultaneous map/reduce tasks in YARN

Question

I have an Hadoop 2.2 cluster deployed on a small number of powerful machines. I have a constraint to use YARN as the framework, which I am not very familiar with.

How do I control the number of actual map and reduce tasks that will run in parallel? Each machine has many CPU cores (12-32) and enough RAM. I want to utilize them maximally.
How can I monitor that my settings actually led to a better utilization of the machine? Where can I check how many cores (threads, processes) were used during a given job?

Thanks in advance for helping me melt these machines :)

Jasper Jasper · Accepted Answer · 2014-02-27T14:00:12

1.
In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had.

These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces

Essentially:
YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots separation. Everything depends on the amount of memory in use/demanded

2.

Using the web UI you can get lot of monitoring/admin kind of info:

NameNode - http://:50070/
Resource Manager - http://:8088/

In addition Apache Ambari is meant for this: http://ambari.apache.org/

And Hue for interfacing with the Hadoop/YARN cluster in many ways: http://gethue.com/

Controling and monitorying number of simultaneous map/reduce tasks in YARN

3 Answers