0
votes

Here's what I'm trying to do:

Load data from Hive into HBase serialized by protocol buffers.

I've tried multiple ways:

  1. create connections directly to HBase and do Puts into HBase. This works, but apparently not very efficient.

  2. I imported the json table out from Hive in S3 and stored them as textfiles (separated by tab), and then use importTsv utilities to generate HFile and bulkload them into HBase, this also works.

But now I want to achieve this in an even more efficient way:

Export my data from Hive table in S3, serialize them into protocol buffers objects, then generate HFile and mount the HFile directly onto HBase.

I'm using Spark job to read from Hive and that can give me JavaRDD, then I could build my protocol buffers objects, but I'm at a loss how to proceed from there.

So my question: how can I generate HFile from protocol buffers objects. We don't want to save them as a textfile on local disk or HDFS, how can I directly generate HFile from there?

Thanks a lot!

1
"using Spark job ... do Puts into HBase ... not very efficient" >> do you use the async HBase interface, with one BufferedMutator per executor? Cf. hbase.apache.org/book.html#_basic_spark (note that HBaseContext requires either HBase 2.x issues.apache.org/jira/browse/HBASE-13992 or a CDH version of HBase 1.x because the Apache back-port has not been released yet issues.apache.org/jira/browse/HBASE-14160)Samson Scharfrichter
A Google query about spark hfileoutputformat points to several interesting posts, including "Efficient bulk load of HBase using Spark" opencore.com/blog/2016/10/…Samson Scharfrichter

1 Answers

0
votes

Thanks to @Samson pointing to that awesome post.

After trials and error, I got things working. Just to save others pain, here's the working example.

What it does: It uses Spark to read data from S3, repartition them into corresponding regions, generate HFiles.