0
votes

When running certain file on Hadoop using map reduce, sometimes it creates 1 map task and 1 reduce tasks while other file can use 4 map and 1 reduce tasks.

My question is based on what the number of map and reduce tasks is being decided?

is there a certain map/reduce size after which a new map/reduce is created?

Many Thanks Folks.

2

2 Answers

0
votes

From the the official doc :

The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

The ideal reducers should be the optimal value that gets them closest to:

  • A multiple of the block size
  • A task time between 5 and 15 minutes
  • Creates the fewest files possible

Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:

  • Terrible performance on the next phase of the workflow
  • Terrible performance due to the shuffle
  • Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
  • Destroying disk IO for no really sane reason
  • Lots of network transfers
0
votes

The number of Mappers is equal to the the number of HDFS blocks for the input file that will be processed. The number of reducers ideally should be about 10% of your total mappers. Say you have 100 mappers then ideally the number of reducers should be somewhere around 10. But however it is possible specify the number of reducers in our Map Reduce job.