My hbase table look like this:
id1/bla value1
id1/blabla value2
id2/bla value3
id2/blabla value4
There are million of keys that start with id1 and millions of key that start with id2.
I want to read the data from hbase with mapReduce because there are a lot of keys that starts with the same Id and 1 map per id isn't good enough. I prefer 100 mappers per Id
I want that more than 1 mapper will run on the same scannerResult that has been filtered by id.
I read about TableMapReduceUtil and tried the following:
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
With map function that will look like this(it should iterate scanner result):
public static class MyMapper extends TableMapper<Text, IntWritable> {
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
text.set("123"); // we can only emit Writables...
context.write(text, ONE);
My questions are:
- How is it possible that the map function get as input Result and not ResultScanner? I know that the result of the scan can be iterated by ResultScanner that can be Iterated by Result. ResultScanner has list\array of Result isn't it?
- How can I iterate on the result of the scanner in the map function?
- How can I control on the number of split this function will do.If it opens only 10 mappers and I want 20 is It possible to change something?
- Is there simplest way to achieve my goal?