2
votes

I have read multiple articles about how HBase gain data locality i.e link or HBase the Definitive guide book.

I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.

Questions:

  1. Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?

  2. What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?

Thanks for the help.

1

1 Answers

1
votes

HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality. When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication. Now I will answer your questions,

  1. Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.

  2. Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.

Hope this helps.