2
votes

We use Mapreduce to bulk create HFiles that are then incrementally/bulk loaded into HBase. Something I have noticed is that the load is simply an HDFS move command (which does not physically move the blocks of the files).

Since we do a lot of HBase table scans and we have short circuit reading enabled, it would be beneficial to have these HFiles localized to their respective region's node.

I know that a major compaction can accomplish this but those are inefficient when there HFiles are small compared to the region size.

1
have you looked at locality index of your RegionServers? What is average localityIndex?Anil Gupta
Yes, and of course after a major compaction it goes to 1 and even after an HFile load it might only drop to 89 depending on the size of the HFile. However it still seems like it would be possible to execute a command that would ensure data locality.Andrew White

1 Answers

1
votes

HBase uses HDFS as a File System. HBase does not controls datalocality of HDFS blocks.
When HBase API is used to write data to HBase, then HBase RegionServer becomes a client to HDFS and in HDFS if client node is also a datanode, then a local block is also created. Hence, localityIndex is high when HBase API is used for writes.

When bulk load is used, HFiles are already present in HDFS. Since, they are already present on hdfs. HBase will just make those hfile part of Regions. In this case datalocality is not guaranteed.

If you really really need high datalocality, then rather than bulk load i would recommend you to use HBase API for writes.
I have been using HBase API to write to HBase from my MR job and they have worked well till now.