0
votes

My question is that, does the MapReduce framework (for example Hadoop implementation), assign the inputs for mappers before mapper job starts or it is done at runtime?

That is, assume I have some input i, and machines m_1,m_2 .. m_k. The machines need not to be equally powered, some may have better performance (CPU,Memory) than other. If the master node splits the inputs to mappers until the mapper task begins, or at least assigns the inputs to particular mapper node, there may be the case that some machines (stronger ones) could finish their job and wait. However, if the split job is done on the runtime, this problem does not arise.

If you also indicate the overall split mechanism of MapReduce in preMapper phase, I would be happy.

2

2 Answers

1
votes

Yes, the inputs for the Map tasks are prepared before the Mapper phase starts in Hadoop. The number of mappers is decided by the number of Input Splits calculated for the given input file before the Mapper phase starts.

Here the Input Split is the logic blocks of the given input file, where by default for every block of the file , one Input Split will be prepared and for every input split one mapper task will be dispatched.

You can control the number of InputSplits by controlling the mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize properties.

The number of nodes available to execute the calculated number of map tasks is depends on the capacity of your cluster.

For example , say your input file is about 100GB(102400 MB) in size and block size 100MB, and Input split size is block size (by default), then 1024 Map tasks will be calculated. In this case assume that you cluster's maximum containers available to execute map/reduce tasks across the cluster is 500, then at the best case only 500 mappers will be executed in Parallel. The machines whichever executes the Map task container sooner will pick the next Map task from the queue and continue so on until all mappers were completed.

hope this helps.

1
votes

In MapReduce framework, the Mapper tasks are assigned to the machines based on the Data Locality Concept. This means, data nodes which are storing the block of the data, will be assigned to execute the mapper task for that block of data.

The data splits (blocks) happen when you store the data into HDFS using configuration defined for data replication and block size. So if the original file is let say 128MB and block size is 64MB then file will be split into two blocks. These blocks will be store on two different machines. Here is quote from HDFS design doc:

A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.

Now when run the MapReduce job for a particular file then two Mapper tasks will be launched on these two machines.

So the data split and launching of mappers are completely two independent things. The first is handled by HDFS framework and second is by MapReduce framework.