I am a newbie to hadoop. Now I am working on a MapR program using avro. The logic of program is correct when running against local hadoop (1 reducer), but I encounter a problem against 8-node CDH cluster that only one out of 64 reducers really do jobs. Logs of other 63 reducers showed that they did not receive any data from mapper.
My data processing is not complicated, actually very simple. Below is the Mapper and Reducer signatures.
public static class MyAvroMap extends Mapper<AvroKey<NetflowRecord>, NullWritable,
Text, AvroValue<NetflowRecord>> {}
public static class MyAvroReduce extends Reducer<Text, AvroValue<NetflowRecord>,
AvroKey<NetflowRecord>, NullWritable> {}
The Map's output key is derived from a string field of NetflowRecord. Is there any problem of selecting shuffle key or anything else about avro? Thanks ahead.
UPDATE: In the experiment above, I involved a 7GB avro file and only one reducer worked. When I increased the volume of input to hundreds of GB, other reducers also got to work. As I know, Hadoop by default has a file split limit of 64MB. But why does it act differently when working against avro data?
BTW: we do not change default file split parameters of CDH if it has.
Jamin