2
votes

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data. Unfortunately, the data in RDBMS has some new line characters.

We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.

I used below options during sqoop import and checked, but didn't work:

--hive-drop-import-delims
--hive-delims-replacement

The above options work for text format. But storing data in text format is not a viable option for me. The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.

I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.

Can anyone suggest me a solution please?

My command:

sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:@Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
2
You are using the combination of bit hive import and HDFS import. for Hive import you shoud use command --hive-import --hive-drop-import-delims --map-column-hive - Sandeep Singh

2 Answers

1
votes

This looks like an issue with avro serde. It is an open bug. https://issues.apache.org/jira/browse/HIVE-14044.

Can you try the same in hive 2.0?

1
votes

As mentioned by VJ, there is an open issue for new line character in avro.

What an alternate approach that you can try is

  1. Sqoop the data into a hive staging table as a textfileformat.
  2. Create an avro table.
  3. Insert data from staging table to main avro table in hive.

As newline character is very well handled in textfileformat