Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
When you say "value for the scan", that's not a real thing. You either mean scan.setCaching() or scan.setBatch() or scan.setMaxResultSize().
setCaching is used to tell the server how many rows to load before returning the result to the client
setBatch is used to limit the number of columns returned in each call if you have a very wide table
setMaxResultSize is used to limit the number of results returned to the client
Typically with you don't set the MaxResultSize in a MapReduce job. So you will see all of the data.
Reference for the above information is here.