2
votes

Problem

I have some jobs that just need to iterate over every record in an HBase table and do some task. For example, extract a field for an export or update a field based on some new business rule.

Reasoning

MapReduce seems overkill here. There's nothing to really map and there is no "reducing" either. The map is always just the key + the record. There is certainly no use for shuffle and sort since they keys are guaranteed to be unique from HBase.

For performance reasons, this should still be distributed. I guess I'm looking for a good old fashion table scan that happens to be distributed.

Question

What options exist to leverage the cluster but avoid the unnecessary steps of a full MapReduce job?

2

2 Answers

2
votes

Co-processors are for this exactly. From the link : "a framework for both flexible and generic extension, and of distributed computation directly within the HBase server processes".

1
votes

You can do a map-only job - it would do exactly what you want To get a map only job you can use the TableMapReduceUtil.initTableMapperJobhelper method and set no reducers job.setNumReduceTasks(0);

Also you can push some of the processing to hbase if you specify a filter for the scan