Re-run the Hadoop job, will the partitioned mapoutput still go to the same Reducers?

Question

In hadoop, suppose the number of nodes is fixed (no server crash during the run), if i use the same partitioner (e.g., hash partitioning on the key of map output) to partition the mapper output, and i execute the job to read the same data set twice. Is it sure that the data with the same key will go to the same reducer? thanks

For example, the my mapoutput consist of two rows: Key | value

A | anything

B | anything

Suppose, I have two reducer 1 and 2. In first run, the line "A|anything" goes to reducer 1, and "B|anything" goes to the reducer2. If i run again, is it possible that "A|anything" goes to reducer 2, and "B|anything" goes to the reducer1?

Thanks!

Because, in my case my new data have to join the old data in the reduce-side. If the old data in the HDFS is read back, and go through the mapper, and paritition by the same key, it means that it only does on the local machine, which won't be very expensive as it does not have to read the data over network from other datanode. — afancy

Praveen Sripati Praveen Sripati · Accepted Answer · 2011-11-17T09:41:22

There is no affinity between the map/reduce tasks and the nodes. When a map/reduce task is due to run, the scheduler picks a free map/reduce slot which ever is available (it may/may not be the same machine from the previous run) to run the task. So, when a Job is re-run the same key might be processed by a different reducer. That's what makes Hadoop framework fault tolerant.

Re-run the Hadoop job, will the partitioned mapoutput still go to the same Reducers?

1 Answers