I'm running java mapreduce against an hbase cluster.
The rowkeys are of the form UUID-yyyymmdd-UUID and groups of rows will have the first UUID (a rowkey prefix) in common. I'll call these rows with the shared prefix a group.
In our hbase cluster we have some groups that contain a lot more data than others. The size of a group could be in the low thousands or it could more than a million.
As I understand, one region will be read by one mapper.
This means regions that contain larger groups get assigned to a single mapper and therefore this single mapper is left to process a lot of data.
I have read about and tested setting the hbase.hregion.max.filesize parameter lower so that regions get split. This does improve performance of the mapreduce job since more mappers are marshaled to process the same data.
However, setting this global max parameter lower can also lead to many more hundreds or thousands of regions, which introduces its own overhead and is not advised.
Now to my question:
Instead of applying a global max, is it possible to split regions based on the rowkey prefix? This way, if a large group hits a certain size it could spill over into another region. But the smaller groups could remain within one region, and keep the overall number regions as low as possible.
Hope this makes sense! Thanks.