Hadoop - How does reducer gets it data?

Question

I understand that the mapper produces 1 partition per reducer. How does the reducer know which partition to copy? Lets say there are 2 nodes running mapper for word count program and there are 2 reducers configured. If each map node produces 2 partitions, with the possibility of partitions in both the nodes containing same word as key, how will the reducer work correctly?

For ex:

If node 1 produces partition 1 and partition 2, and partition 1 contains a key named "WHO".

If node 2 produces partition 3 and partition 4, and partition 3 contains a key named "WHO".

If Partition 1 and Partition 4 went to reducer 1 (and remaining to reducer 2), how does the reducer 1 compute the correct word count?

If this is not a possibility, and partition 1 and 3 would be made to go to reducer 1, how Hadoop does this? Does it make sure a given key-value pair from different nodes always go to a same reducer? If so, how it does this?

Thanks, Suresh.

yjshen yjshen · Accepted Answer · 2012-05-10T05:41:33

In your situation, since partition 1 and partition 3 both with the key 'WHO', it is guaranteed that the two partitions went to the same reducer.

Update

In hadoop, the max number of reduce tasks one a tasktracker at any one time is determined by the mapred.tasktracker.reduce.tasks.maximum property.
And the number of reducers for a MapReduce job is set via -D mapred.reduce.tasks=n

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.(Hadoop: The definitive guide)

So, the value with a specified key would always go to the same reducer.

Hadoop - How does reducer gets it data?

1 Answers

Update