Even if the second option looks more efficient, there are some cases where it is not possible.
One case I had to face is when you need to create an external table in Hive. Since Sqoop does not support creating Hive external tables, another workaround was necessary.
Off Topic - Internal vs External Tables in Hive
External table in HIVE
(stores data on HDFS)
What are external tables
- External table stores files on the HDFS server but tables are not
linked to the source file completely.
- If you delete an external table the file still remains on the HDFS
server.
- The file and the table link is there but read only.
- As an example if you create an external table called “my_test” in
HIVE using HIVE-QL and link the table to file “flat_file.txt”, then
deleting “my_test” from HIVE,this will not delete “flat_file.txt”
from HDFS.
- External table files are accessible to anyone who has access to HDFS
file structure and therefore security needs to be managed at the HDFS
file/folder level.
- Meta data is maintained on master node and deleting an external table
from HIVE, only deletes the metadata not the data/file.
Use external table if you:
- Want to manage the data outside HIVE. So the data files are read and processed by an existing program that doesn’t lock the files e.g. you are planning to use an ETL tool to load/merge data files etc.
- Want to load the latest information to the table but still want to
retain old dataset in a file on HDFS for regulatory/legal purposes.
Are not planning to create a table from another table schema e.g.
Create table1 as (Select * from table2)
Data needs to remain in the underlying location even after a DROP
TABLE. This can apply if you are pointing multiple schemas (tables or
views) at a single data set or if you are iterating through various
possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have
another program or process that will do those things.
Internal table in HIVE
(stores data on HDFS but in a kind of restricted area)
What are internal tables
Internal table file security is controlled solely via HIVE. E.g.
tbl_batting can only be accessed via HDFS internal account:
r = read w = write x = execute
Deleting the table deletes the metadata & data from masternode and
HDFS respectively
- Security needs to be managed within HIVE, probably at the schema
level (depends on organisation to organisation). HDFS security is out
of scope in this case.
Use internal table if you:
Sources:
SOLUTION (With a Workaround to create an External Table in Hive with Sqoop)
It is necessary to import a sample of the table from Oracle to create an avro file, which will be used to generate the hive table. Then you import the whole table from oracle to hdfs.
1 To generate the Avro file
sqoop import --connect jdbc:oracle:thin:@//url:port/database
\
--username my_user --password my_password -m 1 --fetch-size 10 \
--table table_to_import \
--columns columns_to_import \
--create-hive-table --hive-table dummy_table \
--as-avrodatafile --target-dir hdfs:///user/my_user/dummytab/
2 Generate the schema in local
avro-tools getschema
hdfs://urlhadoop:port/user/my_user/dummytab/part-m-00000.avro
schematab.avsc
3 Move the schema back to the hdfs
hdfs dfs -copyFromLocal schematab.avsc /user/my_user/schemas
4 Create the table in Hive
hive -e "CREATE EXTERNAL TABLE MyHiveTable ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION
'hdfs:///user/my_user/tables/' TBLPROPERTIES
('avro.schema.url'='hdfs:///user/my_user/schemas/schematab.avsc');"
5 Import All Data
sqoop import --connect jdbc:oracle:thin:@//url:port/database
\
--username my_user --password my_password -m 1 \
--table table_to_import \
--columns columns_to_import \
--hive-table MyHiveTable \
--as-avrodatafile --target-dir hdfs:///user/my_user/tables/ --append
Source: http://wpcertification.blogspot.com/2015/05/importing-data-from-sqoop-into-hive.html?_sm_au_=iVV10VW0vW1Z7Z42
hive-importdoes make things like importing as parquet table a little bit more complex, but once you get past that, why would you do something in two steps if you can do it in one... Note that when using--hive-importyou need to specify additional options such as--hive-databaseand--hive-table- spijs