I designed the row key of a HBase table as follows:
<clustering-prefix><yyyyMMdd><type><uid>
where the clustering-prefix is a two digit string allowing me to spread the load of a single day into the entire cluster, by grouping events into clusters.
I also created the table using the SPLITS_FILE clause, with the following split sequence:
002015010
012015010
022015010
032015010
042015010
...
992015010
012015011
022015011
...
This allows to send load to 100 regions in parallel, with pretty good performance, considering that the data is time-series.
Unfortunately, I noticed that many regions that are going to be filled in parallel (the day part is the same) are hosted by the same region server. That is, for example:
002015010 -> region_server_01 *
012015010 -> region_server_02
022015010 -> region_server_01 *
032015010 -> region_server_06 **
042015010 -> region_server_03
052015010 -> region_server_01 *
062015010 -> region_server_06 **
Ideally, if I have 100 region servers, I would like to assign regions in such a way that regions with a different clustering prefix (first two digits) are hosted by different region servers as much as possible.
I tried changing the order of region splits in SPLITS_FILE but the behavior didn't change.
The reason for this kind of row is related to read and write requirements:
- Write: events of a single day with different clustering prefixes will be written concurrently as they arrive
- Read: after some amount of time, events with the same clustering prefix received in a date range should be processed in batch by a Spark job
Question
Is there a way to configure HBase (AssignmentManager ?) for assigning different clustering prefixes to different region servers ?
Seems that the behaviour of the assignment process is random by default.