How to run mapreduce on hbase scanner result with TableMapReduceUtil

Question

My hbase table look like this:

    key---------value
    id1/bla     value1
    id1/blabla  value2
    id2/bla     value3
    id2/blabla  value4
    ....

There are million of keys that start with id1 and millions of key that start with id2.

I want to read the data from hbase with mapReduce because there are a lot of keys that starts with the same Id and 1 map per id isn't good enough. I prefer 100 mappers per Id

I want that more than 1 mapper will run on the same scannerResult that has been filtered by id. I read about TableMapReduceUtil and tried the following:

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
    sourceTable,        // input table
    scan,               // Scan instance to control CF and attribute selection
    MyMapper.class,     // mapper class
    Text.class,         // mapper output key
    IntWritable.class,  // mapper output value
    job);

With map function that will look like this(it should iterate scanner result):

public static class MyMapper extends TableMapper<Text, IntWritable>  {

    private final IntWritable ONE = new IntWritable(1);
    private Text text = new Text();

    public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
            text.set("123");     // we can only emit Writables...    
            context.write(text, ONE);
    }
}
<br>

My questions are:

How is it possible that the map function get as input Result and not ResultScanner? I know that the result of the scan can be iterated by ResultScanner that can be Iterated by Result. ResultScanner has list\array of Result isn't it?
How can I iterate on the result of the scanner in the map function?
How can I control on the number of split this function will do.If it opens only 10 mappers and I want 20 is It possible to change something?
Is there simplest way to achieve my goal?

Costi Ciudatu Costi Ciudatu · Accepted Answer · 2016-08-21T18:14:12

I'll start with #4 in your list:

The default behavior is to create one mapper per region. Therefore, instead of trying to hack the TableInputFormat into creating custom input splits based on your specifications, you should first consider splitting your data into 100 regions (and then you'll have 100 mappers pretty well balanced).

This approach improves both your read and write performance, as you'll be less vulnerable to hotspotting (assuming that you have more than one or two region servers in your cluster).

The preferred way to go about this is to pre-split your table (i.e. define the splits on table creation).

How to run mapreduce on hbase scanner result with TableMapReduceUtil

1 Answers