I am creating a Hadoop MapReduce job and I am using two Scans over one HBase table to feed my mappers. The HBase table has 10 regions. I create two scanners, call setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, tableName) on them, then I do this:
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
TableMapReduceUtil.initTableMapperJob(scans, FaultyRegisterReadMapper.class, MeterTimeKey.class, ReadValueTime.class, job);
For some reason, only two mappers are created most of the time. I would like there to be more mappers but that's not really a big deal.
The really bad part is that SOMETIMES it created three mappers and when it does, the first two mappers finish quite quickly but the third mapper doesn't even start for five minutes. It is this mapper that takes so long to start that is really bothering me. :)
This is on a cluster with some 60 nodes and it is not busy.
I suspect the number of mappers might be driven by how much data it's finding in the table but I'm not positive of that.
Main question: Any ideas why one mapper takes so long to start?