I have recently faced a problem about migrating data from Hive to Hbase. We, the project, are using Spark on a cdh5.5.1 cluster (7 nodes running on SUSE Linux Enterprise, with 48 cores, 256 GB of RAM each, hadoop 2.6). As a beginner, I thought it was a good idea to use Spark to load table data from Hive. I am using correct Hive columns / Hbase ColumnFamily and column mapping to insert data in HBase.
I found some solution on how to bulk insert data into Hbase, such as we can use hbaseContext.bulkPut
or rdd.saveAsHadoopDataset
(I tested both for similar results).
The result was a functional program, but the job was really too slow (like 10 minutes/GB and slowing down to 1 hour for 3 GB), and my regionServers memory/heapsizes were way too much used (they could simply crash, depending on the configuration I set).
After modifying the regionServers and Hbase configuration again and again, I tried to use the simple Hive way i.e. creating a external table using the hbase storage handler as an entry point for hbase, and loading with
INSERT OVERWRITE TABLE entry_point
SELECT named_struct('rk_field1', rk_field1, 'rk_field2', rk_field2)
, field1
, field2
FROM hive_table
It went really fine, inserting 22GB of data in hbase in 10 minutes. My question is, why is it so much better that way? Is it a configuration problem? Why would it be such a bad use case for Spark?
Edit : Even using this last technique it's still pretty slow (2 hours to insert 150 GB). The only problem I can see via cloudera manager is the GC time, with an average of 8 seconds, but sometimes increasing to 20 seconds, depending on which regionserver.