0
votes

Recently, I was working on importing data from Oracle to HDFS using Sqoop Import.

Here's one of lessons learned:

I started observe that Sqoop runs boundary query slowly, e.g dataset imported in 5 hours, boundary query took 1 hour.

I looked to basic query and it's SELECT MIN(id), max(id) from user

When I ran this query on huge table(billions records), it stuck and Oracle didn't respond quickly. Whereas, SELECT MIN(id) from user OR SELECT max(id) from user runs fast.

I had no details of Oracle table structure, index information etc.

In this post I want to share how I solved it in my case and want to ask community if anyone solved the same differently.

Another question, why Sqoop developers didn't consider performance of boundary query.

I'm not sure if Sqoop is still frequently used, vote this question if it's still relevant and I will make a post of all lessons learned.

1

1 Answers

0
votes

Here's is optimized boundary query we are using:

SELECT
    *
FROM
    (
        SELECT
            MIN( id ) min_
        from
            user
    ) v1,
    (
        SELECT
            MAX( id ) max_
        from
            user
    ) v2;