creating hive table using gcloud dataproc not working for unicode delimiter

Question

I need to create a hive table on a unicode delimited file(unicode charcter - ."\uFFFD", replacement character)

To do this we are submitting hive jobs to cluster. Tried with Lazy simple serde using ROW FORMAT Delimited -

gcloud dataproc jobs submit hive --cluster --region --execute "CREATE EXTERNAL TABLE hiveuni_test_01(codes string,telephone_num string,finding_name string,given_name string,alt_finding_name string,house_num string,street_name string,locality string,state string,reserved string,zip_code string,directive_text string,special_listing_text string,id string,latitude string,longitude string,rboc_sent_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\uFFFD' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"

But this does not create the table correctly , entire row is put to the first column only.

We are using cloud SQL mysql server as hive metastore , checked that mysql has utf8 encoding also.

Tried with multidelimitserde -

gcloud dataproc jobs submit hive --cluster dev-sm-35cb3516-ed82-4ec2-bf0d-89bd7e0e60f0 --region us-central1 --jars gs://hive-idaas-dev-warehouse/hive-jar/hive-contrib-0.14.0.jar --execute "CREATE EXTERNAL TABLE hiveuni_test_05 (codes string,telephone_num string,finding_name string,given_name string,alt_finding_name string,house_num string,street_name string,locality string,state string,reserved string,zip_code string,directive_text string,special_listing_text string,id string,latitude string,longitude string,rboc_sent_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ('field.delim'='\uFFFD') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"

This gives an exception - java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.MultiDelimitSerDe not found

I have put an initialization script during start of the cluster which will place the hive-contrib-0.14.0.jar containing the class org.apache.hadoop.hive.serde2.MultiDelimitSerDe in /usr/lib/hadoop/lib/. I see that jar is placed in the folder by doing ssh to the cluster.

Is there a way to read unicode characters by hive client while creating table or why do I still get an error classNotFound even after placing the jar in hadoop lib directory?

cyxxy cyxxy · Accepted Answer · 2021-01-23T00:24:25

hive-contrib-0.14.0 does not have org.apache.hadoop.hive.serde2.MultiDelimitSerDe. Instead the full qualified class name is org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe. Notice the extra contrib there.

So change your query to use the correct fully qualified class name and see if it solves the issue. You probably don't have to explicitly add a hive-contrib jar. It should be already under /usr/lib/hive/lib.

HIVE-20020 and HIVE-20619 were done on Hive 4.0, and since you are using Dataproc, it shouldn't apply since Dataproc does not have Hive 4.0 yet.

creating hive table using gcloud dataproc not working for unicode delimiter

1 Answers