Loading data in hive table with multiple charsets

Question

I am facing issues where i have multiple files with different charsets, say one file has Chinese charsets and other has French Charsets, how can i load them in a single hive table? I searched online and found this :-

ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');

With this i can handle charsets for one of the file either Chinese or French. Is there a way to handle both charsets once?

[UPDATE]

Okay i am using RegexSerde for fixed width file alongside encoding scheme being used is - ISO 8859-1. Seems Regex Serde is not taking this encoding scheme into account and splitting the characters considering default UTF-8 encoding scheme. Is there a way to take encoding scheme into account with Regex serde.

hlagos hlagos · Accepted Answer · 2017-01-26T15:03:59

I am not sure if this is possible (i think it isn't based on https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/AbstractEncodingAwareSerDe.java). A workaround could be create two tables with different enconding and create a view on top of that.

Loading data in hive table with multiple charsets

1 Answers