1
votes

I am a newbie to hadoop and map reduce model and trying to get the concepts right.

I would first like to get concept of input splits and number of mappers correct.

I am running mapreduce wordcount program and following are my questions.

1) How is the input splits determined? I ran same program on same cluster with 2 different sized input.

file 1 : size 48mb. => i got number of splits:1 in log.
file 2: size 126mb => number of splits : 1 
file 2 : size 126mb ( executed in eclipse IDE) => number of splits: 4

should not be the number of splits equal to 2 for the 126 mb file? Becuase I have read that the block size is 64 MB. so it would have to create 2 splits.

2) How is number of mappers determined? I am trying to get number of mappers to understand the workflow of mapreduce through following line.

conf.get("mapred.map.tasks")

It returns 2 everytime .

3) is there any relation between number of splits and number of mappers?

4) do above things depends on the cluster ? is it same for pseudo distributed mode and other cluster or different?

Thank you.

1

1 Answers

9
votes

In MapReduce InputFormat class is responsible for providing the split's information. An input split is the amount of data that goes into one map task.

  1. From Hadoop 2.4 default block size is 128MB hence you are seeing 1 split for 126MB file.
  2. Number of mapper's is determined by the number of splits for the input path, assume if you are processing on a dir which has 10 files and each file is made up of 10 splits then your job would require 100 mappers to process the data.
  3. Yes, like I said in most of the cases number of splits = number of mappers unless Hadoop knows how to calculate the splits. For example in case of compressed file formats like Gzip which are not splittable in that case number of files = number of mappers.
  4. No, its the same for pseudo and cluster modes.

More information:

  1. Default split size and changing split size
  2. How are splits calculated
  3. Record splits across block boundaries