How to use sqoop validation?

Question

Can you please help me with the below points.

I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.

Please help me team.Thanks.

Thank You, Sipra

I’ve added the validation part in my answer, please upvote and right my answer if it is helpful and works for you. — roh

roh roh · Accepted Answer · 2017-12-17T14:41:41

My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.

how do I know if there is new records in oracle since last run/job/check ??

You can do this in two sqoop import approaches, following is the examples and explanation for both.

sqoop incremental

Following is an example for the sqoop incremental import

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc

This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html

sqoop import using query option

Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.

Here is the syntax for it sqoop import \ --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username retail_dba --password cloudera \ --query 'select * from sample_data where $CONDITIONS AND salary > 1000' \ --split-by salary \ --target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new

Isolate the validation and import job

If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.

Syntax :$ sqoop eval \ --connect jdbc:mysql://localhost/db \ --username root \ --query “SELECT * FROM employee LIMIT 3”

Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm

validation parameter in sqoop

You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS

—validate

More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation

How to use sqoop validation?

1 Answers