I am receiving csv formatted files (fields are comma delimited and double-quoted) into HDFS and have developed a pig script which removes the header rows and strips the double quotes before I insert the data into Hive using an HQL script.
This process has been working fine; however, today I discovered a data issue with one of the tables. The files for this table in particular have a string field which can contain multiple commas within the double quotes. This is resulting in the data being incorrectly loaded into the wrong columns in Hive for some of the records.
I cannot change the format of the files at the source.
Currently I am using the PiggyBank CSVExcelStorage to process the csv formatting as follows. Can this be modified to produce the correct result? What other options do I have? I noticed that there is also a CSVLoader now but havent found any examples showing how to use/implement it. Pig CSVLoader
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')
Editing to add additional sample data and results of testing:
Sample Input File Data:
"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"
Using CSVExcelLoader with formatting provided above:
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q
Using CSVLoader as CSVLoader(): Notice - Didn't see any options for parameters to be provided to the constructor
P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q
The only real difference I see is that CSVLoader isn't removing the header row as I saw no option to select this and instead its removing some of the header names.
Am I doing something incorrectly? A working solution will be appreciated.