0
votes

We have a HBase based system where we would like to bulk load a few million rows on a daily basis in production. We think that HBase Bulk Load will be a good option compared to puts - the bulk load feature uses a MapReduce job to output table data in HBase’s internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. We have evaluated this and it works fine. The following section in the Reference guide talks about the limitations:

72.2. Bulk Load Limitations As bulk loading bypasses the write path, the WAL doesn’t get written to as part of the process. Replication works by reading the WAL files so it won’t see the bulk loaded data – and the same goes for the edits that use Put.setDurability(SKIP_WAL). One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there.

This is a big problem since we want to use high availability. We also found another JIRA HBASE-13153 which suggests that replication works after the fix.

Questions:

  1. Is Bulk Load intended for production use?
  2. Is the HBase documentation out of date and the limitation is resolved now?
  3. Are there any other limitations of using Bulk Load? If yes, what is the preferred approach?
1

1 Answers

0
votes
  1. Yes. Many users use the bulk load in production
  2. Yes after HBASE-13153, that mentioned limitation is not there. But see Release Note of the issue. By default this feature is off. You will have to configure it to work. And then the bulk loaded files also will get replicated to the peer cluster(s). Ya the doc is out of date. Will get that fixed soon. Also please see the fix versions. You will have to select the versions accordingly.
  3. Other limitations : You need security (ACLs) to be used? If so, the ACL need for bulk load is CREATE permission on table/CF not just WRITE permission. Please keep that in mind. This is not a limitation but just saying. There were some defects in bulk load but in 1.3+ versions these should be fixed already.

Ya sure you can try using the bulk load way of data writes and that seems a perfect match for your use case.