We have a HBase based system where we would like to bulk load a few million rows on a daily basis in production. We think that HBase Bulk Load will be a good option compared to puts - the bulk load feature uses a MapReduce job to output table data in HBase’s internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. We have evaluated this and it works fine. The following section in the Reference guide talks about the limitations:
72.2. Bulk Load Limitations As bulk loading bypasses the write path, the WAL doesn’t get written to as part of the process. Replication works by reading the WAL files so it won’t see the bulk loaded data – and the same goes for the edits that use Put.setDurability(SKIP_WAL). One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there.
This is a big problem since we want to use high availability. We also found another JIRA HBASE-13153 which suggests that replication works after the fix.
Questions:
- Is Bulk Load intended for production use?
- Is the HBase documentation out of date and the limitation is resolved now?
- Are there any other limitations of using Bulk Load? If yes, what is the preferred approach?