As mentioned also in Which HBase connector for Spark 2.0 should I use? mainly there are two options:
- RDD based https://github.com/apache/hbase/tree/master/hbase-spark
- DataFrame based https://github.com/hortonworks-spark/shc
I do understand the optimizations and the differences with regard to READING from HBase.
However it's not clear for me which should I use for BATCH inserting into HBase.
I am not interested in one by one records, but by high throughput.
After digging through code, it seems that both resort to TableOutputFormat, http://hbase.apache.org/1.2/book.html#arch.bulk.load
The project uses Scala 2.11, Spark 2, HBase 1.2
Does the DataFrame library provide any performance improvements over the RDD lib specifically for BULK LOAD ?