Hadoop only one job do the work

Question

I am a newbie to hadoop. Now I am working on a MapR program using avro. The logic of program is correct when running against local hadoop (1 reducer), but I encounter a problem against 8-node CDH cluster that only one out of 64 reducers really do jobs. Logs of other 63 reducers showed that they did not receive any data from mapper.

My data processing is not complicated, actually very simple. Below is the Mapper and Reducer signatures.

public static class MyAvroMap extends Mapper<AvroKey<NetflowRecord>, NullWritable,
                                             Text, AvroValue<NetflowRecord>> {}
public static class MyAvroReduce extends Reducer<Text, AvroValue<NetflowRecord>, 
                                             AvroKey<NetflowRecord>, NullWritable> {}

The Map's output key is derived from a string field of NetflowRecord. Is there any problem of selecting shuffle key or anything else about avro? Thanks ahead.

UPDATE: In the experiment above, I involved a 7GB avro file and only one reducer worked. When I increased the volume of input to hundreds of GB, other reducers also got to work. As I know, Hadoop by default has a file split limit of 64MB. But why does it act differently when working against avro data?

BTW: we do not change default file split parameters of CDH if it has.

Jamin

Binary01 Binary01 · Accepted Answer · 2013-10-04T10:19:03

The problem seems to because the key that is being generated from map is leading to calling only one reducer after partitioner generates called. Due to this reason other 63 reducers go empty. So calling of reducers depends on the keys that are generated. Please check the partitioner logic below:-

/** Partition keys by their {@link Object#hashCode()}. */

public class HashPartitioner extends Partitioner {

/** Use {@link Object#hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; }

}

Here the return value decides which reducer to invoke.

I hope this answers your doubt.

Hadoop only one job do the work

1 Answers