Problem
I have some jobs that just need to iterate over every record in an HBase table and do some task. For example, extract a field for an export or update a field based on some new business rule.
Reasoning
MapReduce seems overkill here. There's nothing to really map and there is no "reducing" either. The map is always just the key + the record. There is certainly no use for shuffle and sort since they keys are guaranteed to be unique from HBase.
For performance reasons, this should still be distributed. I guess I'm looking for a good old fashion table scan that happens to be distributed.
Question
What options exist to leverage the cluster but avoid the unnecessary steps of a full MapReduce job?