1
votes

I am copying data from one Hive table to another Hive table(External) in Spark SQL code for the data volume 74 million rows (~50 GB). The insert operation is taking more than 40 mins.

hiveContext.sql("insert overwrite table dev_work.WORK_CUSTOMER select * from  dev_warehouse.CUSTOMER")

I have tried the other data copy ways such as:

  1. hdfs -cp for these external tables:

hdfs dfs -cp hdfs:/home/dummy/dev_dwh/CUSTOMER hdfs:/home/dummy/dev_work/WORK_CUSTOMER

  1. Export Import :
export table dev_warehouse.CUSTOMER to 'hdfs_exports_location/customer';
import external table dev_work.WORK_CUSTOMER from 'hdfs_exports_location/CUSTOMER';

Cluster details:

CDH 5.8 , 19 Node Cluster

Could you please help to tune the performance to find any alternate way to perform fast data copy.

Thanks, Arvind

1

1 Answers

0
votes

Trying Hadoop DistCp which is a tool used for large inter/intra-cluster copying

http://hadoop.apache.org/docs/r2.7.3/hadoop-distcp/DistCp.html