Hive to RDBMS incremental export

0

votes

I am exploring sqoop to send data from hive to RDBMS. I don't want to send the same data again and again. I need to identify the changes in HDFS and send only the data that has changed since my previous export. What is the best way to implement such incremental export logic? I see that sqoop import has incremental logic option; but can't see it in export.

Any recommendations / suggestions would greatly be appreciated.

hadoophivesqooprdbms

Hive doesn't have a transaction log, therefore there's no way to know what's changed between two export operations. At least, as far as I know. Most RDBMS do have logs, which is why incremental import is possible - OneCricketeer

0

votes

You can create new table or view with the change records in Hive (TABLE_NAME_CHANGED) and use that to import into RDBMS.

0

votes

You can achieve the incremental export below is my explanation on it assuming you have a timestamp field in the hive to identify the delta.

Every time before you export the data you will have to check the maximum timestamp in the RDBMS and use it to create your export file.

##Checking the max date in RDBMS
#You can tweak with the command based on the stack thats produced by your sqoop 
mxdt=$(sqoop eval --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --query "select max(timestamp_filed) from schema.table" | awk "NR==6{print;exit}" | sed 's/|//g' | sed ''s/[^[:print:]]//g'' | sed 's/ //g')

#Based on the mxdt variable you can create a file from beeline/hive as below
beeline -u ${ConnString} --outputformat=csv2 --showHeader=false --silent=true --nullemptystring=true --incremental=true -e "select * from hiveSchema.hiveTable where timestamp > ${mxdt}" >> /SomeLocalPath/FileName.csv

#Copy file to hdfs

hdfs dfs -put /SomeLocalPath/FileName.csv2 /tmp/

#Now use the file in hdfs to do the sqoop export
sqoop export --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --export-dir '/tmp/FileName.csv' --table RDBMSSCHEMA.RDBMSTABLE --fields-terminated-by "," --lines-terminated-by "\n" -m 1 --columns "col1,col2,"

Hive to RDBMS incremental export

2 Answers