Mapreduce execution in a hadoop cluster

Question

I am a bit confused about How exactly the Mapreduce works. I have read some articles but didn't get the proper answer.

Scenario:

I stored a file of size 1 TB on top of HDFS (Let's say it is stored at a location /user/input/ ). Replication is 3 and the block size 128 MB.

Now, I want to analyze this 1TB file using mapreduce. Since the block size is 128 MB, I will have 8192 blocks in total.Considering I have 100 machines in the cluster then

Will 8192 map tasks will spawned on all the 100 nodes, evenly distributing the number of mappers ? Or it will run on only those nodes where replicated data is placed ?

Ravindra babu Ravindra babu · Accepted Answer · 2017-01-29T16:59:25

Number of Mappers depend on InputSplits and not on replication factor.

Refer to below post for understanding internals of InputSplits:

How does Hadoop process records split across block boundaries?

The number of mappers and reducers are decided by Hadoop Framework.

Refer to below post for more details:

Default number of reducers

For simplicity sake, assume that HDFS block and InputSplit are same without data spanning across multiple data nodes.

In your case, 1 TB file processing requires 8192 Maps. While starting Map task, map task tries to run Mapper on the node where data is present. 1 TB file of 8192 blocks may not have evenly distributed on 100 nodes. If they are evenly distributed on 100 nodes, Framework will run the map tasks on all 100 nodes. Data locality plays key role in selection of data node.

Mapreduce execution in a hadoop cluster

2 Answers