1
votes

I am a bit confused about How exactly the Mapreduce works. I have read some articles but didn't get the proper answer.

Scenario:

I stored a file of size 1 TB on top of HDFS (Let's say it is stored at a location /user/input/ ). Replication is 3 and the block size 128 MB.

Now, I want to analyze this 1TB file using mapreduce. Since the block size is 128 MB, I will have 8192 blocks in total.Considering I have 100 machines in the cluster then

Will 8192 map tasks will spawned on all the 100 nodes, evenly distributing the number of mappers ? Or it will run on only those nodes where replicated data is placed ?

2

2 Answers

1
votes

Number of Mappers depend on InputSplits and not on replication factor.

Refer to below post for understanding internals of InputSplits:

How does Hadoop process records split across block boundaries?

The number of mappers and reducers are decided by Hadoop Framework.

Refer to below post for more details:

Default number of reducers

For simplicity sake, assume that HDFS block and InputSplit are same without data spanning across multiple data nodes.

In your case, 1 TB file processing requires 8192 Maps. While starting Map task, map task tries to run Mapper on the node where data is present. 1 TB file of 8192 blocks may not have evenly distributed on 100 nodes. If they are evenly distributed on 100 nodes, Framework will run the map tasks on all 100 nodes. Data locality plays key role in selection of data node.

1
votes

Number of mappers to be run does not depend on number of nodes or blocks or any other thing they only depend on total number of input splits. In database context an split might correspondence to range of rows.

Now it is possible that an block in HDfS is of 128 mb and size of input split is 256 mb in that case only 1 mapper will run over this input split which is covering 2 block. Now question arises how do input split is created These splits are created by InputFormat class which contain getSplit and createrecordreader method which are responsible for creating splits and you can override these method if you want to change the way these splits are created.

These mappers job are started on different nodes of cluster but there is no guarantee that it will be evenly distributed. Mapreduce always try to give mapper job to a node which have local data to be processed. If that is not possible it will give mapper job to node with best resources.

Notice that input split do not contain actual data. They have reference to data. These stored location help mapredUce in assigning jobs.

I will suggest you to visit this link http://javacrunch.in/Yarn.jsp it will give you a impression about how yarn work for job allocation. You can also visit this for internal working of map reduce http://javacrunch.in/MR.jsp.

Hope this solve your query