Please see my process & issue below: step1 -I want to load one mysql table data in a incremental fashion.So first I am using SQOOP to import incremental data using 'INCREMENTAL append check-column last-value' assume this stores into HDFS folder '/sqoopdir1' step 2 - I want to use a pig script to do some data validations like duplicate checking and the result should go to '/user/root/pigdir' step 3 - I want to create a hive external table 'extsample' pointing to pig output directory as "stored as textfile location '/user/root/pigdir'
My issue here is (1)SQOOP will create part-m-00001,part-m-00002 etc in folder /sqoopdir1 in HDFS.But for each incremental process, I just need latest file for PIG process to pick automatically. (2) Another issue issue is, PIG creates part-r-00000 file in /user/root/pigdir. Second time job fails saying file already exists, so I have to rename the file to run the PIG job.
In short I want to have initial loaded data + just inserted/updated records in final hive table.As I am a beginner, could you please help to resolve the issues mentioned or advise a better approach.Thanks!