0
votes

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?

I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.

2
How do you know your block resides on node2? Which scheduler are you using? - cabad

2 Answers

1
votes

I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data.

Well, there might be an exceptional case :

If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).

HTH

-1
votes

I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.

Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.

You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).