How is the number of map and reduce tasks is determined?

Question

When running certain file on Hadoop using map reduce, sometimes it creates 1 map task and 1 reduce tasks while other file can use 4 map and 1 reduce tasks.

My question is based on what the number of map and reduce tasks is being decided?

is there a certain map/reduce size after which a new map/reduce is created?

Many Thanks Folks.

Gyanendra Dwivedi Gyanendra Dwivedi · Accepted Answer · 2018-03-29T17:38:13

From the the official doc :

The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

The ideal reducers should be the optimal value that gets them closest to:

A multiple of the block size
A task time between 5 and 15 minutes
Creates the fewest files possible

Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:

Terrible performance on the next phase of the workflow
Terrible performance due to the shuffle
Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
Destroying disk IO for no really sane reason
Lots of network transfers

How is the number of map and reduce tasks is determined?

2 Answers