2
votes

Apache Pig - How to read data from CSV file with data optionally enclosed within double quotes?

Sample data is provided below:

"Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01 

Expected Output :

Traditional  0.03  Department, of Housing and Urban Development (HUD)  0.01

In the above example we have 4 columns. 2 are enclosed in double quotes and 2 are not and are of floating data type. Moreover there is 3rd column which is having a comma in the data itself.

Please help me with some Pig related API's (sample code) which would help to split the data correctly and process them using positional notation say $0, $1, $2, $3.

I have explored CSVExcelStorage and CSVLoader from PiggyBank, but I am not able to split properly.

2

2 Answers

1
votes
a = LOAD 'filename.csv' USING PigStorage (',') AS (fieldname:chararray, fieldname2:float);

DUMP a;
1
votes

Option 1 – Using CSVLoader or CSVExcelStorage

 REGISTER piggybank.jar;
 DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

 a = load 'data' USING CSVLoader(',') AS (field1:chararray,field2:double,
                                          field3:chararray,field4:chararray);

 b = FOREACH a GENERATE $0,$1,$2,$3;

 DUMP b;

Option 2 – TextLoader + STRSPLIT + REPLACE

 A = LOAD '/path/to/files/' USING TextLoader() AS (line:chararray);

 B = FOREACH A GENERATE REPLACE(line,'"','');

 C = FOREACH B GENERATE FLATTEN(STRSPLIT(line, ','));

 DUMP C;