1
votes

Reading the paper about MapReduce and there is mention of sorting all intermediate keys to be grouped together.

When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used

There there is mention of the same reduce task being exectued on multiple machines.

When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file.

If the same keys are grouped together, won't that become one reduce task to be run by one reduce worker? How can the same reduce task be run on multiple machines?

2

2 Answers

1
votes

. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file.

It's possible due to speculative execution.

If a particular Map or Reduce task is taking long time, Hadoop Framework starts same task on different machine speculating that long running task had some issues. The slowness in long running task can be caused by network failures, busy machine or faulty hardware.

You can find more details about this concept in this SE question:

Hadoop speculative task execution

From Apache documentation page @ Task Side-Effect Files:

There could be issues with two instances of the same Mapper or Reducer running simultaneously (for example, speculative tasks) trying to open and/or write to the same file (path) on the FileSystem. Hence the application-writer will have to pick unique names per task-attempt (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.

To avoid these issues the MapReduce framework, when the OutputCommitter is FileOutputCommitter, maintains a special ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory accessible via ${mapreduce.task.output.dir} for each task-attempt on the FileSystem where the output of the task-attempt is stored.

1
votes

I think you took it wrong. It means that if a single reduce task is large enough then instead of processing it on single machine it is processed on multiple machines then output file from machine is renamed, aggregated and presented as single output file.

Multiple reduce processes can occur on same node. It depends on speed of that node if it is fast enough to process reduce task as compare to other nodes, if yes then it is again feeded with another reduce task.

For more information refer https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html There is topic in this doc " how many reduces ? " I think that will solve your query.

I hope I am able to solve your query.