HBase MapReduce split scan for different mappers

Question

I'm struggling distributing my HBase rows in a proper way for several map tasks. My aim is to split my scan via row key and distribute a set of rows each to a map job.

As far as now I am only able to define a scan where my mappers get always one row at a time. But that is not what I want - I need the map-input set-wise.

So is there a possibility to split-up my HBase table resp. the scan into n sets of rows, which are then input for n mappers?

I am not looking for a solution to start a MapReduce job writing n files and another MapReduce job for reading them back again as text input for getting these sets.

Thanks in advance!

Arnon Rotem-Gal-Oz Arnon Rotem-Gal-Oz · Accepted Answer · 2013-04-18T19:27:29

Mappers will always get one row at a time - that's the way map-reduce work if you want to relate to multiple rows on the map side you can either do that yourself (e.g using some static variables etc.) or write the logic as a combiner which is a map-side "reduce" step.

Note that you'd still need a reducer to handle the edge cases where related keys were handles by different mappers - since in hbase keys are ordered on disk you'd only get that at the end/begining of a split. You can reduce the risk of this happening by pre-splitting

HBase MapReduce split scan for different mappers

2 Answers