How time series data can be spread evenly across the cluster?

Question

I designed the row key of a HBase table as follows:

<clustering-prefix><yyyyMMdd><type><uid>

where the clustering-prefix is a two digit string allowing me to spread the load of a single day into the entire cluster, by grouping events into clusters.

I also created the table using the SPLITS_FILE clause, with the following split sequence:

This allows to send load to 100 regions in parallel, with pretty good performance, considering that the data is time-series.

Unfortunately, I noticed that many regions that are going to be filled in parallel (the day part is the same) are hosted by the same region server. That is, for example:

002015010 -> region_server_01 *
012015010 -> region_server_02
022015010 -> region_server_01 *
032015010 -> region_server_06 **
042015010 -> region_server_03
052015010 -> region_server_01 *
062015010 -> region_server_06 **

Ideally, if I have 100 region servers, I would like to assign regions in such a way that regions with a different clustering prefix (first two digits) are hosted by different region servers as much as possible.

I tried changing the order of region splits in SPLITS_FILE but the behavior didn't change.

The reason for this kind of row is related to read and write requirements:

Write: events of a single day with different clustering prefixes will be written concurrently as they arrive
Read: after some amount of time, events with the same clustering prefix received in a date range should be processed in batch by a Spark job

Question

Is there a way to configure HBase (AssignmentManager ?) for assigning different clustering prefixes to different region servers ?

Seems that the behaviour of the assignment process is random by default.

Rubén Moraleda Rubén Moraleda · Accepted Answer · 2015-01-13T14:16:53

Only use the salt when setting the splits:

create 't1', 'f1', {SPLITS => ['01', '02', '03', '04' ... '99']}

Rowkeys starting with 00.... will be assigned to region 1

Rowkeys starting with 01.... will be assigned to region 2

Rowkeys starting with 02.... will be assigned to region 3

And so on... HBase load balancer will take care of regions distribution across your regionservers depending on the load (number of regions assigned to each server)

BTW, your examples have 9 chars, according to what you said, they should be 10 (2 for salt + 8 for date).

How time series data can be spread evenly across the cluster?

1 Answers