0
votes

I designed the row key of a HBase table as follows:

<clustering-prefix><yyyyMMdd><type><uid>

where the clustering-prefix is a two digit string allowing me to spread the load of a single day into the entire cluster, by grouping events into clusters.

I also created the table using the SPLITS_FILE clause, with the following split sequence:

002015010
012015010
022015010
032015010
042015010
...
992015010
012015011
022015011
...

This allows to send load to 100 regions in parallel, with pretty good performance, considering that the data is time-series.

Unfortunately, I noticed that many regions that are going to be filled in parallel (the day part is the same) are hosted by the same region server. That is, for example:

002015010 -> region_server_01 *
012015010 -> region_server_02
022015010 -> region_server_01 *
032015010 -> region_server_06 **
042015010 -> region_server_03
052015010 -> region_server_01 *
062015010 -> region_server_06 **

Ideally, if I have 100 region servers, I would like to assign regions in such a way that regions with a different clustering prefix (first two digits) are hosted by different region servers as much as possible.

I tried changing the order of region splits in SPLITS_FILE but the behavior didn't change.

The reason for this kind of row is related to read and write requirements:

  • Write: events of a single day with different clustering prefixes will be written concurrently as they arrive
  • Read: after some amount of time, events with the same clustering prefix received in a date range should be processed in batch by a Spark job

Question

Is there a way to configure HBase (AssignmentManager ?) for assigning different clustering prefixes to different region servers ?

Seems that the behaviour of the assignment process is random by default.

1

1 Answers

0
votes

Only use the salt when setting the splits:

create 't1', 'f1', {SPLITS => ['01', '02', '03', '04' ... '99']}

Rowkeys starting with 00.... will be assigned to region 1

Rowkeys starting with 01.... will be assigned to region 2

Rowkeys starting with 02.... will be assigned to region 3

And so on... HBase load balancer will take care of regions distribution across your regionservers depending on the load (number of regions assigned to each server)


BTW, your examples have 9 chars, according to what you said, they should be 10 (2 for salt + 8 for date).