7
votes

We are using Cloudera CDH 4 and we are able to import tables from our Oracle databases into our HDFS warehouse as expected. The problem is we have 10's of thousands of tables inside our databases and sqoop only supports importing one table at a time.

What options are available for importing multiple tables into HDFS or Hive? For example what would be the best way of importing 200 tables from oracle into HDFS or Hive at a time?

The only solution i have seen so far is to create a sqoop job for each table import and then run them all individually. Since Hadoop is designed to work with large dataset it seems like there should be a better way though.

7

7 Answers

4
votes
  1. Assuming that the sqoop configuration for each table is the same, you can list all the tables you need to import and then iterate over them launching sqoop jobs (ideally launch them asynchronously). You can run the following to fetch the list of tables from Oracle: SELECT owner, table_name FROM dba_tables reference

  2. Sqoop does offer an option to import all tables. Check this link. There are some limitations though.

  3. Modify sqoop source code and recompile it to your needs. The sqoop codebase is well documented and nicely arranged.

5
votes

U can use " import-all-tables " option to load all tables into HDFS at one time .

sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop  --target-dir '/Sqoop21/AllTables'

if we want to exclude some tables to load into hdfs we can use " --exclude-tables " option

Ex:

sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop  --target-dir '/Sqoop21/AllTables'  --exclude-tables <table1>,<tables2>

If we want to store in a specified directory then u can use " --warehouse-dir " option

Ex:

sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop --warehouse-dir '/Sqoop'
3
votes

--target-dir is not a valid option when using import-all-tables.

To import all tables in particular directory, Use --warehouse-dir instead of --target-dir.

Example:

$ sqoop import-all-tables --connect jdbc:mysql://localhost/movies --username root --password xxxxx --warehouse-dir '/user/cloudera/sqoop/allMoviesTables' -m 1

2
votes

The best option is do my shell script

Prepare a inputfile which has list of DBNAME.TABLENAME 2)The shell script will have this file as input, iterate line by line and execute sqoop statement for each line.

while read line;
do

    DBNAME=`echo $line | cut -d'.' -f1` 
    tableName=`echo $line | cut -d'.' -f2`


    sqoop import -Dmapreduce.job.queuename=$QUEUE_NAME --connect '$JDBC_URL;databaseName=$DBNAME;username=$USERNAME;password=$PASSWORD' --table $tableName  --target-dir $DATA_COLLECTOR/$tableName  --fields-terminated-by '\001'  -m 1 

done<inputFile
0
votes

You can use Sqoop "import-all-tables" feature to import all the tables in the database. This also has another parameter, --exclude-tables, along with which you can exclude some of the table that you don't want to import in the database.

Note: --exclude-tables only works with import-all-tables command.

0
votes

importing multiple tables by sqoop if no of tables are very less. Create sqoop import for each table as below .

  • sqoop import --connect jdbc:mysql://localhost/XXXX --username XXXX
    password=XXXX
    --table XXTABLE_1XX*
  • sqoop import --connect jdbc:mysql://localhost/XXXX --username XXXX
    password=XXXX
    --table XXTABLE_2XX*

and so on.

But what if no of tables are 100 or 1000 or even more. Below would be ideal solution.

In such scenario, preparing shell script which takes input from text file containing list of table names to be imported, iterate over, run the scoop import job for each table