Using an HBase table as MapReduce source

Question

As far as I understood when using an hbase table as the source to a mapreduce job, we have define the value for the scan. LEt's say we set it to 500, does this mean that each mapper is only given 500 rows from the hbase table? Is there any problem if we set it to a very high value ?

If the scan size is small, don't we have the same problem as having small files in mapreduce?

Your question is unclear. Can you post your code where you're configuring the Scan object for the job and clarify your question please? — Pradeep Gollakota
I don't have any code yet, this is more like a design question — HHH

Pradeep Gollakota Pradeep Gollakota · Accepted Answer · 2015-04-25T04:11:17

Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
   tableName,        // input HBase table name
   scan,             // Scan instance to control CF and attribute selection
   MyMapper.class,   // mapper
   null,             // mapper output key
   null,             // mapper output value
   job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

When you say "value for the scan", that's not a real thing. You either mean scan.setCaching() or scan.setBatch() or scan.setMaxResultSize().

setCaching is used to tell the server how many rows to load before returning the result to the client
setBatch is used to limit the number of columns returned in each call if you have a very wide table
setMaxResultSize is used to limit the number of results returned to the client

Typically with you don't set the MaxResultSize in a MapReduce job. So you will see all of the data.

Reference for the above information is here.

Using an HBase table as MapReduce source

2 Answers