Does Hadoop split the keys into several reducer pools?

Question

I am trying to run a hadoop job on a very large amount of data, using up to 32 reducers. But when I look in the output for each reducer I see that it may happen that more than one reducer gets a key (of course with different values). Can this behavior be avoided while using more reducers?

LE: I've tried and used the Text class instead, but the problem is that though it works fine, my jvm eventually crashes due to running low on heap space. What are the criteria hadoop uses for partitioning data into key pools apart from the compareTo ?

Chris White Chris White · Accepted Answer · 2012-03-21T14:57:13

You say you have a custom key (which implements WritableComparable), have you overridden the hashCode() method?

If you're using the HashPartitioner (which is the default), and haven't overridden the hashCode() method in your custom key, then two identical keys, from different mappers will most probably go to different reducers (the result of hashCode() is modulo'd with the number of reducers to determine the reducer to send the key/value pair to). This is because, by default the hashCode() method is native, and returns the address in memory of the object

a simple hashCode implementation for your key could be as simple as adding the hashcodes of the tuple fields together (assuming those fields have non-native hashCode implementations themselves):

public int hashCode() {
    return field1.hashCode() + field2.hashCode()
}

Does Hadoop split the keys into several reducer pools?

2 Answers