hadoop node unused for map tasks

Question

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?

I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.

How do you know your block resides on node2? Which scheduler are you using? — cabad

Tariq Tariq · Accepted Answer · 2013-07-02T19:01:23

I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data.

Well, there might be an exceptional case :

If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).

HTH

hadoop node unused for map tasks

2 Answers