Hadoop: check how many mapper node actually ran

Question

I'm running a MR program with a different number of mapper and reducer to test how the execution time changes. I came to the point where I can set the split size to change the number of mappers, and I'm seeing some changes in execution times. I'm using a remote machine (quad-core with hyper-threading). Hadoop version : 1.2.1 input file size: 1GB

So, what I want to do now is to verify that the MR was really running as I configured.

For example, I set the split size to about 250MB so that I have four mappers. In the output file (_logs/history/job....), I see that it says

TOTAL MAP TASKS = 4
LAUNCHED MAP TASKS = 4
FINISHED MAP TASKS = 4
DATA-LOCAL MAP TASKS = 1

(1) In this case, can I say that four cores (four mappers) were used?

(2) When I run TOP, I only see two Java processes and two python processes (the MR program are written in python).Even if I expect to have 4 mappers or 8 mappers, I always see two Java processes only. Does it mean that I'm not utilizing other cores?

If you have one remote machine with 4 cores, that means one node, not four. — alko

alko alko · Accepted Answer · 2013-12-04T18:13:29

(1, 2) TOTAL MAP TASK do not reflect parallel or serial usage. It means total amount of task prcessed, so if you see two java processes, you have your tasks executed 2 at a time.

Split size controls number of map tasks generated, but each node can run potentially infinite map tasks number, with predefined amount of simultaneously running mapper jobs (up to, not all mappers might be running, there are some waiting time based on job tracker interaction and other stuff).

You can control running mappers count per node with mapred.tasktracker.map.tasks.maximum paramether. And you'll probably need to adjust JVM memory settings in order to add more mappers. Up to mapred.tasktracker.map.tasks.maximum amount of mapper processes (separate JVM instances) will be launched, and if its number is equal to cores number, usually will utilize all the cores. Note thet it is the OS who schedules processes among cores, and it's up to the OS to perform load balancing and performance optimization.

Note however, that for map tasks often IO is the bottleneck, not CPU, so parallel execution not necessarily leads to speedup on a single machine. Ofcourse, if you don't have some sophisticated raid configurations.

enter image description here

(3) If TOTAL MAP TASK is 14, then your job was actually split into 14 parts.

Hadoop: check how many mapper node actually ran

1 Answers