0
votes

Please see my process & issue below: step1 -I want to load one mysql table data in a incremental fashion.So first I am using SQOOP to import incremental data using 'INCREMENTAL append check-column last-value' assume this stores into HDFS folder '/sqoopdir1' step 2 - I want to use a pig script to do some data validations like duplicate checking and the result should go to '/user/root/pigdir' step 3 - I want to create a hive external table 'extsample' pointing to pig output directory as "stored as textfile location '/user/root/pigdir'

My issue here is (1)SQOOP will create part-m-00001,part-m-00002 etc in folder /sqoopdir1 in HDFS.But for each incremental process, I just need latest file for PIG process to pick automatically. (2) Another issue issue is, PIG creates part-r-00000 file in /user/root/pigdir. Second time job fails saying file already exists, so I have to rename the file to run the PIG job.

In short I want to have initial loaded data + just inserted/updated records in final hive table.As I am a beginner, could you please help to resolve the issues mentioned or advise a better approach.Thanks!

1

1 Answers

0
votes

1) As per my understanding you are trying to sqoop each incremental data into same folder? But that cant be true, sqoop will also throw an exception folder already exists. Part-m-00 files are output of sqoop so if you think about it all those "part" files are your incremental data generated by single sqoop ingestion.

We need to handle this in the code, like for each sqoop incremental ingestion we need to provide a dynamic output location (append timestamp to folder name)... Same goes with Pig. If you are using shell script to call sqoop and pig passing a dynamic location is very easy.