I need to create a hive table on a unicode delimited file(unicode charcter - ."\uFFFD", replacement character)
To do this we are submitting hive jobs to cluster. Tried with Lazy simple serde using ROW FORMAT Delimited -
gcloud dataproc jobs submit hive --cluster --region --execute "CREATE EXTERNAL TABLE hiveuni_test_01(codes string,telephone_num string,finding_name string,given_name string,alt_finding_name string,house_num string,street_name string,locality string,state string,reserved string,zip_code string,directive_text string,special_listing_text string,id string,latitude string,longitude string,rboc_sent_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\uFFFD' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
But this does not create the table correctly , entire row is put to the first column only.
We are using cloud SQL mysql server as hive metastore , checked that mysql has utf8 encoding also.
Tried with multidelimitserde -
gcloud dataproc jobs submit hive --cluster dev-sm-35cb3516-ed82-4ec2-bf0d-89bd7e0e60f0 --region us-central1 --jars gs://hive-idaas-dev-warehouse/hive-jar/hive-contrib-0.14.0.jar --execute "CREATE EXTERNAL TABLE hiveuni_test_05 (codes string,telephone_num string,finding_name string,given_name string,alt_finding_name string,house_num string,street_name string,locality string,state string,reserved string,zip_code string,directive_text string,special_listing_text string,id string,latitude string,longitude string,rboc_sent_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ('field.delim'='\uFFFD') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'gs://hive-idaas-dev-warehouse/datasets/unicode_file';"
This gives an exception - java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.MultiDelimitSerDe not found
I have put an initialization script during start of the cluster which will place the hive-contrib-0.14.0.jar
containing the class org.apache.hadoop.hive.serde2.MultiDelimitSerDe
in /usr/lib/hadoop/lib/
. I see that jar is placed in the folder by doing ssh to the cluster.
Is there a way to read unicode characters by hive client while creating table or why do I still get an error classNotFound even after placing the jar in hadoop lib directory?