I am a newbie to hadoop and map reduce model and trying to get the concepts right.
I would first like to get concept of input splits and number of mappers correct.
I am running mapreduce wordcount program and following are my questions.
1) How is the input splits determined? I ran same program on same cluster with 2 different sized input.
file 1 : size 48mb. => i got number of splits:1 in log.
file 2: size 126mb => number of splits : 1
file 2 : size 126mb ( executed in eclipse IDE) => number of splits: 4
should not be the number of splits equal to 2 for the 126 mb file? Becuase I have read that the block size is 64 MB. so it would have to create 2 splits.
2) How is number of mappers determined? I am trying to get number of mappers to understand the workflow of mapreduce through following line.
conf.get("mapred.map.tasks")
It returns 2 everytime .
3) is there any relation between number of splits and number of mappers?
4) do above things depends on the cluster ? is it same for pseudo distributed mode and other cluster or different?
Thank you.