1
votes

I have recently faced a problem about migrating data from Hive to Hbase. We, the project, are using Spark on a cdh5.5.1 cluster (7 nodes running on SUSE Linux Enterprise, with 48 cores, 256 GB of RAM each, hadoop 2.6). As a beginner, I thought it was a good idea to use Spark to load table data from Hive. I am using correct Hive columns / Hbase ColumnFamily and column mapping to insert data in HBase.

I found some solution on how to bulk insert data into Hbase, such as we can use hbaseContext.bulkPut or rdd.saveAsHadoopDataset (I tested both for similar results).

The result was a functional program, but the job was really too slow (like 10 minutes/GB and slowing down to 1 hour for 3 GB), and my regionServers memory/heapsizes were way too much used (they could simply crash, depending on the configuration I set).

After modifying the regionServers and Hbase configuration again and again, I tried to use the simple Hive way i.e. creating a external table using the hbase storage handler as an entry point for hbase, and loading with

INSERT OVERWRITE TABLE entry_point 
    SELECT named_struct('rk_field1', rk_field1, 'rk_field2', rk_field2)
    , field1
    , field2 
FROM hive_table

It went really fine, inserting 22GB of data in hbase in 10 minutes. My question is, why is it so much better that way? Is it a configuration problem? Why would it be such a bad use case for Spark?

Edit : Even using this last technique it's still pretty slow (2 hours to insert 150 GB). The only problem I can see via cloudera manager is the GC time, with an average of 8 seconds, but sometimes increasing to 20 seconds, depending on which regionserver.

1
It will be much better if you can share the technology distribution for your use case, that will help answer the question.Amit Kumar
Done. Thanks for the advice.Nosk
Are the Hbase, spark worker nodes common?Amit Kumar
Every spark gateway also has the regionServer role, yes.Nosk
Both spark and HBase consume are RAM hungry try to segregate them on different node or make sure enough RAM is available for them if running on same node.Amit Kumar

1 Answers

1
votes

The reason why HBase data load is slow because of put operations. normal put operation in HBase includes,

  • entry in WAL( Write Ahead Log)
  • mem store flushes
  • and all the way to writing data to hdfs as HFiles.

if you are doing a Bulk Load into HBase, then you should consider doing it through HfileFormat2, it is much faster compared to the regular HBase put.

we came across the same situation, trying to load 2 TB of data into HBase through put, it took around 10 hrs to load the data, after configuring and tuning HBase, load time reduced to 7-8 hrs.

we then decided to load as HFileFormat, Inorder to achieve this

  1. first understand your data, then create a table with pre splitted regions
  2. process the input data set and write the results into HFileFormat through a Spark/Map-Reduce Job
  3. Finally Load the Data into HBase table by using hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles