3
votes

I have been using sqoop create-hive-table command with appropriate arguments which would copy the table schema from mysql or any RDBMS into hive. Today I came across an requirement where I had to use

"sqoop create-hive-table" along with --hive-partition-key to create table schema with partition column specified. The sqoop command I used:

sqoop create-hive-table --connect jdbc:mysql://localhost/XYZ \ --username XYZ --password password \ --table POC_XYZ_Partition \ --hive-table POC_XYZ.POC_Datatype_Mapping_Sodhi \ --hive-partition-key join_date \ --fields-terminated-by ',' --lines-terminated-by '\n'

which errored out with the error message:

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Partition key join_date cannot be a column to import.

I know partition key should not be a part of create table command but I totally forgot that.

I then found a work-around and did:

sqoop create-hive-table --connect jdbc:mysql://localhost/XYZ \ --username XYZ --password password \ --table POC_XYZ_Partition \ --hive-table POC_XYZ.POC_Datatype_Mapping_Sodhi \ --hive-partition-key join_date_partition \ --fields-terminated-by ',' --lines-terminated-by '\n'

The above created the schema but later when I tried to load data from mysql into hdfs it is considering the join_date_partition to be a part of mysql table and not getting the data from mysql.

I tried many solutions like:

sqoop import --connect jdbc:mysql://localhost/XYZ \ --username XYZ --password password \ --query 'select a.*, a.join_date as join_date_partition from POC_XYZ_Partition a WHERE $CONDITIONS' \ --split-by id \ --hcatalog-database POC_XYZ \ --hcatalog-table POC_XYZ_Partition;

but could not import data because of a bug in sqoop which would not allow me to duplicate the column names in sqoop.

Has anyone faced the issue and resolved it? If so, please suggest

1

1 Answers

2
votes

It seems that -hive-partition-key (and value) only does matter for uploading data into table but not for table creation. This also forces one to load only into a single partition at time.

But you could try to do your task in several steps:

  1. Create a partitioned_table.
  2. Load data by sqoop with -create-hive-table into plain hive plain_table.
  3. Load data into partitioned table by using something like:

INSERT OVERWRITE TABLE partitioned_table PARTITION(partitionColumn) SELECT col1, ... colN, partitionColumn FROM plain_table;