HBase mapreduce job - Multiple scans - How to set the table of each Scan

Question

I use HBase 1.2. I would like to run a MapReduce job on HBase using multiple scans. In the API, there is : TableMapReduceUtil.initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job).

But how to specify the table of each scan ? I use the code below :

List<Scan> scans = new ArrayList<>();
for (String firstPart : firstParts) {
    Scan scan = new Scan();
    scan.setRowPrefixFilter(Bytes.toBytes(firstPart));
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scans.add(scan);
}
TableMapReduceUtil.initTableMapperJob(scans, MyMapper.class, Text.class, Text.class, job);

It gives the following Exception

Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:436)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:184)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:241)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:240)
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1325)

I think it's normal since the tables on which each scan should be applied are not specified anywhere.

But how to do it ?

I tried to add

scan.setAttribute("scan.attributes.table.name", Bytes.toBytes("my_table"));

but it gives the same error

Ram Ghadiyaram Ram Ghadiyaram · Accepted Answer · 2017-03-15T13:36:11

From docs https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

List<Scan> scans = new ArrayList<Scan>();

 Scan scan1 = new Scan();
 scan1.setStartRow(firstRow1);
 scan1.setStopRow(lastRow1);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table1);
 scans.add(scan1);

 Scan scan2 = new Scan();
 scan2.setStartRow(firstRow2);
 scan2.setStopRow(lastRow2);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table2);
 scans.add(scan2);

 TableMapReduceUtil.initTableMapperJob(scans, TableMapper.class, Text.class,
     IntWritable.class, job);

Your Case :

use Scan.SCAN_ATTRIBUTES_TABLE_NAME since you are not setting table at scan instance level you are getting this NPE...

please follow this example where you have to set the table name inside your for loop not outside ... then it should work

List<Scan> scans = new ArrayList<Scan>();

  for(int i=0; i<3; i++){
    Scan scan = new Scan();

    scan.addFamily(INPUT_FAMILY);
    scan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, Bytes.toBytes(TABLE_NAME ));

    if (start != null) {
      scan.setStartRow(Bytes.toBytes(start));
    }
    if (stop != null) {
      scan.setStopRow(Bytes.toBytes(stop));
    }

    scans.add(scan);

    LOG.info("scan before: " + scan);

HBase mapreduce job - Multiple scans - How to set the table of each Scan

1 Answers

From docs https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

Your Case :