Recently, I was working on importing data from Oracle to HDFS using Sqoop Import.
Here's one of lessons learned:
I started observe that Sqoop runs boundary query slowly, e.g dataset imported in 5 hours, boundary query took 1 hour.
I looked to basic query and it's SELECT MIN(id), max(id) from user
When I ran this query on huge table(billions records), it stuck and Oracle didn't respond quickly. Whereas, SELECT MIN(id) from user
OR SELECT max(id) from user
runs fast.
I had no details of Oracle table structure, index information etc.
In this post I want to share how I solved it in my case and want to ask community if anyone solved the same differently.
Another question, why Sqoop developers didn't consider performance of boundary query.
I'm not sure if Sqoop is still frequently used, vote this question if it's still relevant and I will make a post of all lessons learned.