2
votes

I have to get the filename with each row so i used

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray);

But in data.csv some columns have comma(,) in content as well so to handle comma issue i used

data = LOAD 'data.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage()AS (filename:chararray);

But I didn't get any option to use -tagFile option with CSVExcelStorage. Please let me know how can i use CSVExcelStorage and -tagFile option at once?

Thanks

2

2 Answers

2
votes

I got the way to perform both operation(get the file name in each row and replace delimiter if it appears in column content)

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);

/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');

/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;

Once data is loaded properly without comma , i am free to perform any operation. Detailed use case is available at my blog

0
votes

You can't use -tagefile with CSVExcelStorage since CSVExcelStorage does not have -tagFile option.The workaround is to change the delimiter of the file and use PigStorage with the new delimiter and -tagFile or replace the comma in your data.