1
votes

I'm struggling distributing my HBase rows in a proper way for several map tasks. My aim is to split my scan via row key and distribute a set of rows each to a map job.

As far as now I am only able to define a scan where my mappers get always one row at a time. But that is not what I want - I need the map-input set-wise.

So is there a possibility to split-up my HBase table resp. the scan into n sets of rows, which are then input for n mappers?

I am not looking for a solution to start a MapReduce job writing n files and another MapReduce job for reading them back again as text input for getting these sets.

Thanks in advance!

2

2 Answers

1
votes

Mappers will always get one row at a time - that's the way map-reduce work if you want to relate to multiple rows on the map side you can either do that yourself (e.g using some static variables etc.) or write the logic as a combiner which is a map-side "reduce" step.

Note that you'd still need a reducer to handle the edge cases where related keys were handles by different mappers - since in hbase keys are ordered on disk you'd only get that at the end/begining of a split. You can reduce the risk of this happening by pre-splitting

1
votes

Looking into the implementation I saw that calling the map-step with one scan results in exactly one mapper used. This is why the input set is not split at all.

Using a list of scans, giving it to the TableMapReduceUtil.initTableReducerJob function, the input set is split at each scan. Thereby one can define the number of mappers used in the MapReduce job.

Another way would be to extend the TableInputFormat class and rewrite the split method.

As Arnon Rotem-Gal-Oz said correctly, one can only access one row at a time within the mapper's map function.