3
votes

I am having an HBase cluster of 5-nodes and mostly having input request of fetching sequential data.

For optimizing the storage, I ran manual region-splitting on highly loaded regions but it doesn't optimise much as it splits the region but mostly on same region-server.

How can I control region-splitting in this way

r-1(k1 to k2) on server s1,
r-2(k2 to k3) on server s2,
r-3(k3 to k4) on server s3,
r-4(k4 to k5) on server s4,
r-5(k5 to k6) on server s5,
r-6(k6 to k7) on server s1,

I.e, after splitting, no consecutive regions goes to same server to control the load on same server.

1
What makes you think that this is causing problems? The loadbalancer runs once every 5 minutes by default, and moves regions around/splits regions to even out the cluster load. That should be enough. Distribution among clusters is then taken care of by HDFS.Hari Menon
Thanks Raze2dust for replying.! The only problem in having consecutive regions on same region-server is- while requesting for sequential data it takes more time as it exceeds the limit of base.regionserver.handler.count & some request goes into waiting state.Sandeep Jain
Just for example: After default loadbalancing, I noticed the distribution of regions are like- r-1 on S4, r-2 on S1, r-3 on S1, r-4 on S2, r-5 on S2, r-6 on S2, r-7 on S3, r-8 on s5, .. & Now each regions are having almost same number of request/seconds. But a new request of fetching data which lies between region r-4 to r-6 fully depends on only single server S-2. How can I control regions distribution in such way that no consecutive regions goes to same server. ThanksSandeep Jain

1 Answers

0
votes

I am assuming by server you mean RegionServer. The regions are allotted regionservers randomly, so if your cluster is big enough, this situation should not occur (or should occur rarely). The idea is that you shouldn't need to bother about this. Also, understand that the regionserver is only a gateway for the data. It relies on HDFS to fetch the actual data, and where the data is coming from, is decided by HDFS.

Besides, even if consecutive regions end up being served by the same RS, you should be able to use multithreading to get the data faster. HBase already internally runs a separate thread for each region AFAIK. Usually, it doesn't lead to too much load. Did you see that there is actually excessive load due to this? Did you do any profiling to see what is causing the load?

So, there should really be no need to do this, but in special cases, you can use the HBaseAdmin.move method to achieve this. You can possibly write some code to go through all the regions of a table using HTable.getRegionLocations(), sort the regions as per the start keys and manually (using HBaseAdmin.move()) ensure that all consecutive regions are on separate regionservers. But I strongly doubt that this is actually a problem, and I would advise you to confirm this before going for this approach.