My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.
Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!
I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.
The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?
Thanks for you input!