Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
...
TableMapReduceUtil.initTableMapperJob(
tableName,
scan,
MyMapper.class,
null,
null,
job);
job.setOutputFormatClass(NullOutputFormat.class);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
When you say "value for the scan", that's not a real thing. You either mean scan.setCaching()
or scan.setBatch()
or scan.setMaxResultSize()
.
setCaching
is used to tell the server how many rows to load before returning the result to the client
setBatch
is used to limit the number of columns returned in each call if you have a very wide table
setMaxResultSize
is used to limit the number of results returned to the client
Typically with you don't set the MaxResultSize
in a MapReduce job. So you will see all of the data.
Reference for the above information is here.