Sqoop Import of Data having new line character in avro format and then query using hive

Question

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data. Unfortunately, the data in RDBMS has some new line characters.

We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.

I used below options during sqoop import and checked, but didn't work:

--hive-drop-import-delims
--hive-delims-replacement

The above options work for text format. But storing data in text format is not a viable option for me. The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.

I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.

Can anyone suggest me a solution please?

My command:

sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:@Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1

You are using the combination of bit hive import and HDFS import. for Hive import you shoud use command --hive-import --hive-drop-import-delims --map-column-hive — Sandeep Singh

VJ M VJ M · Accepted Answer · 2018-04-06T10:24:31

This looks like an issue with avro serde. It is an open bug. https://issues.apache.org/jira/browse/HIVE-14044.

Can you try the same in hive 2.0?

Sqoop Import of Data having new line character in avro format and then query using hive

2 Answers