1
votes

I am using Sqoop to import MySQL tables to HDFS. To do that, I use a free-form query import.

--query "SELECT $query_select FROM $table where \$CONDITIONS"

This query is quite slow because of the min(id) and the max(id) search. To improve performances, I've decided to use --boundary-query and specify manually lower-bound and upper-bound. ( https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html):

--boundary-query "select 176862848, 172862848"

However, sqoop doesn't care about specified value and again tries to find minimum and maximum "id" by itself.

16/06/13 14:24:44 INFO tool.ImportTool: Lower bound value: 170581647
16/06/13 14:24:44 INFO tool.ImportTool: Upper bound value: 172909234

The complete sqoop command:

sqoop-import -fs hdfs://xxxxxxxxx/ -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris" -m $nodes_number\
    --connect jdbc:mysql://$server:$port/$database --username $username --password $password\
    --target-dir $destination_dir --boundary-query "select 176862848, 172862848"\
    --incremental append --check-column $id_column_name --last-value $last_value\
    --split-by $id_column_name --query "SELECT $query_select FROM $table where \$CONDITIONS"\
    --fields-terminated-by , --escaped-by \\ --enclosed-by '\"'

Does anyone has already met/solved this problem? Thanks

3
Do not answer your own question as an edit but put it as an answer. You can then accept that. - Rob

3 Answers

1
votes

I've managed to solve this problem by deleting the following arguments:

--incremental append --check-column $id_column_name --last-value $last_value

It seems that there is a concurrency between arguments --boundary-query, --check-column, --split-by and --incremental append

0
votes

You are correct..

We should not use --split-by with --boundary-query control argument.

0
votes

try like this..

--boundary-query "select 176862848, 172862848 from tablename limit 1" \