0
votes

I am receiving csv formatted files (fields are comma delimited and double-quoted) into HDFS and have developed a pig script which removes the header rows and strips the double quotes before I insert the data into Hive using an HQL script.

This process has been working fine; however, today I discovered a data issue with one of the tables. The files for this table in particular have a string field which can contain multiple commas within the double quotes. This is resulting in the data being incorrectly loaded into the wrong columns in Hive for some of the records.

I cannot change the format of the files at the source.

Currently I am using the PiggyBank CSVExcelStorage to process the csv formatting as follows. Can this be modified to produce the correct result? What other options do I have? I noticed that there is also a CSVLoader now but havent found any examples showing how to use/implement it. Pig CSVLoader

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

Editing to add additional sample data and results of testing:

Sample Input File Data:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

Using CSVExcelLoader with formatting provided above:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

Using CSVLoader as CSVLoader(): Notice - Didn't see any options for parameters to be provided to the constructor

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

The only real difference I see is that CSVLoader isn't removing the header row as I saw no option to select this and instead its removing some of the header names.

Am I doing something incorrectly? A working solution will be appreciated.

1
Does the data in Hive need to have those comma's in the field? One way to handle this is to replace the comma in the fields with another character say '|' and then load the data.VK_217
@inquisitive_mind Yes, I need to preserve the original formatting of the data.HendPro12

1 Answers

2
votes

To get around the issue of commas in the fields, you can try this work around.

Load the data as a line.
Treat "," as the delimiter and replace it with a pipe character i.e. '|'.
Replace the beginning and end quote " with empty string.
Load the line into hive with '|' as the delimiter.

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

A = LOAD 'test1.csv' AS (lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

ranked = rank A;

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

B = FILTER ranked BY (rank_A > 1);

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

C = FOREACH B GENERATE REPLACE($1,'","','|');

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

D = FOREACH C GENERATE REPLACE($0,'"','');

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

You can now load this data to hive with '|' as the delimiter.

enter image description here