Here's what I'm trying to do:
Load data from Hive into HBase serialized by protocol buffers.
I've tried multiple ways:
create connections directly to HBase and do Puts into HBase. This works, but apparently not very efficient.
I imported the json table out from Hive in S3 and stored them as textfiles (separated by tab), and then use importTsv utilities to generate HFile and bulkload them into HBase, this also works.
But now I want to achieve this in an even more efficient way:
Export my data from Hive table in S3, serialize them into protocol buffers objects, then generate HFile and mount the HFile directly onto HBase.
I'm using Spark job to read from Hive and that can give me JavaRDD, then I could build my protocol buffers objects, but I'm at a loss how to proceed from there.
So my question: how can I generate HFile from protocol buffers objects. We don't want to save them as a textfile on local disk or HDFS, how can I directly generate HFile from there?
Thanks a lot!
BufferedMutator
per executor? Cf. hbase.apache.org/book.html#_basic_spark (note thatHBaseContext
requires either HBase 2.x issues.apache.org/jira/browse/HBASE-13992 or a CDH version of HBase 1.x because the Apache back-port has not been released yet issues.apache.org/jira/browse/HBASE-14160) – Samson Scharfrichterspark hfileoutputformat
points to several interesting posts, including "Efficient bulk load of HBase using Spark" opencore.com/blog/2016/10/… – Samson Scharfrichter