From the the official doc :
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
The ideal reducers should be the optimal value that gets them closest to:
- A multiple of the block size
- A task time between 5 and 15 minutes
- Creates the fewest files possible
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:
- Terrible performance on the next phase of the workflow
- Terrible performance due to the shuffle
- Terrible overall performance because you've overloaded the
namenode
with objects that are ultimately useless
- Destroying disk IO for no really sane reason
- Lots of network transfers