Understanding MapReduce in detail

Question

I read an article about MapReduce but I am still confused about how the job is split into tasks (in detail) to take advantage of parallel processing, especially in cases like this: Assume that after Map process, we have 100 millions records (key/value pairs) with 5 keys, namely 'key1', key2', 'key3', key4', 'key5'. The first key has 99 millions records, the rest of the keys have 0.25 million each. If we have 3 workers to do the reduce tasks, how does the Master split the job? I have read that each key is processed by only one reducer, so if a reducer has to process 'key1', then would it work a lot more than the others and the parallel processing of reducers doesn't help much in this case?

Alex Suo Alex Suo · Accepted Answer · 2014-08-12T03:36:31

Map reduce technique has several assumptions comes by default:

That the jobs are not inter-dependent i.e. you don't have to run task1 first to get its output; then run task2 with the output of task1; etc.
That the jobs can be divided into tasks that are "similar" in execution power needed and time taken. Your example is an extreme case for this assumption thus Map-reduce doesn't work well.
That a sensible divide strategy exists and such strategy won't take more time than running the tasks.
That the task which can be paralleled is the major effort in the task and they don't depend on some serial / single resource. E.g. disk IO.

In the reality there are a lot of problems satisfy the 4 points above (of course a lot doesn't, that's why Map-reduce isn't a universal solution). Common examples including all problems that are large in input data count, need separate processing, expensive in computational time but small in input data total size. E.g.

Determine if a line intersect with a 3D structure where you could have a lot of triangle faces and you run intersection determination for each of the triangles
Price a large number of financial products

Hope the above helps.

Understanding MapReduce in detail

2 Answers