0
votes

I am getting data in below format :

100|15|N-PS-GL-PSJOB|1,A|JFGLFX48|"AAAA"|102

100|15|N-PS-GL-PSJOB|2,A|JFGLFX48|"AAEE"|102

100|15|N-PS-GL-PSJOB|1,A|JFGLFX48|"AXXX"|102

100|15|N-PS-GL-PSJOB|2,A|JFGLFX48|"ABCH"|102

I need to implement

  1. parse with "|" and split fourth column value using ','
  2. remove quotes

I used array as datatype for 4th column and opencsvserede to remove quotes.But how can I pass split method(COLLECTION ITEMS TERMINATED BY ',') inside opencsv serede.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES("separatorChar" = "|","quoteChar" = "\"")

Thanks Surya

1

1 Answers

0
votes

There could be couple of ways to solve this. The best option will be to pre-process this data in a hive landing table and then load the resultant data to a destination hive table that may be ORC/PARQUET file type that yields you better performance than plain text read via serde.

Create Landing table pointing to your text data

create external table landing (c1 string, c2 string, c3 string, c4 string, c5 string, c6 string, c7 string) row format delimited fields terminated by '|' location '/user/hdfs/landing';

Create destination table

create external table destination(c1 string,c2 string,c3 string,c4 string,c5 string,c6 string ,c7 string,c8 string) stored as orc location '/user/hdfs/destination';

Insert data to destination selecting records from landing

insert into destination select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from landing;

You can re-use the collection logic if the 4th column is supposed to be a collection. For sake of simplicity, I split them to 2 separate string columns.

If you would just want to create a view on the first table, that should also be fine. But a view is still going to execute on text data which is relatively slow than a ORC/Parquet format.

Results

select * from landing

100 15  N-PS-GL-PSJOB   1,A JFGLFX48    "AAAA"  102
100 15  N-PS-GL-PSJOB   2,A JFGLFX48    "AAEE"  102
100 15  N-PS-GL-PSJOB   1,A JFGLFX48    "AXXX"  102
100 15  N-PS-GL-PSJOB   2,A JFGLFX48    "ABCH"  102

select * from destination

100 15  N-PS-GL-PSJOB   1   A   JFGLFX48    AAAA    102
100 15  N-PS-GL-PSJOB   2   A   JFGLFX48    AAEE    102
100 15  N-PS-GL-PSJOB   1   A   JFGLFX48    AXXX    102
100 15  N-PS-GL-PSJOB   2   A   JFGLFX48    ABCH    102

Other options being :-

select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from t1;

Or if you want to hide the complexity of the replace and split from the query, just create a view with the same query

create view landingview as select c1, c2, c3, split(c4,',')[0] , split(c4,',')[1] ,c5, regexp_replace(c6,'\"','') , c7 from landing;

Both of which yields the same result

100 15  N-PS-GL-PSJOB   1   A   JFGLFX48    AAAA    102
100 15  N-PS-GL-PSJOB   2   A   JFGLFX48    AAEE    102
100 15  N-PS-GL-PSJOB   1   A   JFGLFX48    AXXX    102
100 15  N-PS-GL-PSJOB   2   A   JFGLFX48    ABCH    102