I read an article about MapReduce but I am still confused about how the job is split into tasks (in detail) to take advantage of parallel processing, especially in cases like this: Assume that after Map process, we have 100 millions records (key/value pairs) with 5 keys, namely 'key1', key2', 'key3', key4', 'key5'. The first key has 99 millions records, the rest of the keys have 0.25 million each. If we have 3 workers to do the reduce tasks, how does the Master split the job? I have read that each key is processed by only one reducer, so if a reducer has to process 'key1', then would it work a lot more than the others and the parallel processing of reducers doesn't help much in this case?
2 Answers
Map reduce technique has several assumptions comes by default:
That the jobs are not inter-dependent i.e. you don't have to run task1 first to get its output; then run task2 with the output of task1; etc.
That the jobs can be divided into tasks that are "similar" in execution power needed and time taken. Your example is an extreme case for this assumption thus Map-reduce doesn't work well.
That a sensible divide strategy exists and such strategy won't take more time than running the tasks.
That the task which can be paralleled is the major effort in the task and they don't depend on some serial / single resource. E.g. disk IO.
In the reality there are a lot of problems satisfy the 4 points above (of course a lot doesn't, that's why Map-reduce isn't a universal solution). Common examples including all problems that are large in input data count, need separate processing, expensive in computational time but small in input data total size. E.g.
Determine if a line intersect with a 3D structure where you could have a lot of triangle faces and you run intersection determination for each of the triangles
Price a large number of financial products
Hope the above helps.
The input data with same key doesn’t have to be assigned to one reducer. Many reducers can share the input data with same key.
Imagine the merge sort for example. Map jobs divide an array into several sub-arrays. Multiple layers of reduce job sort and merge those sub-arrays back into one array. No matter how the data is arranged in an array, the complexity is still be O(n log n). Actually, the complexity of merge sort in best case and worst case are the same as average case. The way of merge sort algorithm to divide and merge the array won’t be affected by the data arrangement.