1
votes

I am new to MapReduce - hadoop world . And in configurations and documents they are specifying Number of mappers and reducers. What does it actually mean? My doulbts are:

  1. Is it specify number of levels mapping/reducing will be done. ie, if No. of reducer=2.Then reduce method will be called 2 times. Is it?
  2. Is it specifying Number of mapper/reducer threads working parallel, but each do map/reduce only once.

Which one is correct? or it means someothing else.I am in confusion.. Please answer me

2

2 Answers

3
votes

No you are completely wrong.

  1. Specifying the number of maptasks only gives a hint to the framework, the input format determines the number of input splits, one split = one map task.
  2. The number of reduce tasks says in how many tasks the map output keys are divided, say you have 1000 different map output keys and 5 reduce tasks then each reduce task will get approximately 200 keys. For each key the reduce function is called, so approx. 200 times per reduce task in my example.
  3. The number of maptasks, reduce tasks doesn't say anything about parallellisation. The number of parallel threads per node is determined by the tasktracker. So you should specify the number of map and reduce slots available to a tasktracker to be run in parallel. This is configured with mapred.tasktracker.map|reduce.tasks.maximum. Note that a reducer (in a slot) will perform a task. So it is important to configure the number of reduce slots and the number of reduce tasks accordingly. If you have 10 reduce slots in total you want at least 10 reduce tasks as well or some slots would be idle.
1
votes

As you are new to Map-Reduce, I strongly believe you need to go through this Paper: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf

Most of you doubts will be cleared once the Paradigm is clearly understood. And its the perfect starting point.