4
votes

My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.

Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!

I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.

The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?

Thanks for you input!

2

2 Answers

0
votes

I believe that this is a known issue. Currently HBase distributes regions across the cluster as a whole without regard for which table they belong to.

Consult the HBase book for background: http://hbase.apache.org/book/regions.arch.html

It could be that you are on an older version of hbase: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155

See the following for a discussion of load balancing and region moving http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549

0
votes

By default, it just balance regions on each RS without take table into account.

You can set hbase.master.loadbalance.bytable to get it.