Eliminating duplicate key/value pairs from mappers in hadoop

Question

If I get the same key/value pairs from 2 different mappers running on 2 different datanodes, and if I am using a single reducer, how can I eliminate the duplicate key/value pair and prevent it from entering the reducer?

Should I use a combiner and then check if there are duplicate values for the same key and then eliminate it in the combiner? But the combiner takes as input all key value pairs from single mapper, right?

David Gruzman David Gruzman · Accepted Answer · 2012-07-20T09:57:59

It is exactly the duty of reducer - to process such duplication. I think there is no way in hadoop to allow it exactly for this reason.
As you pointed in a right way - combiner will not entirely help here, but only reduce the number of such duplications

Eliminating duplicate key/value pairs from mappers in hadoop

1 Answers