2
votes

I have a comma separated value file.

Data example:

1001,Laptop,beautify,laptop amazing price,<HTML>XYZ</HTML>,1345

1002,Camera,Best Mega Pixel,<HTML>ABC</HTML>,4567

1003,TV,Best Price,<HTML>DEF</HTML>,8791

We have only 5 columns: id, Device, Description, HTML Code, Identifier.

For a few of the records there is an extra , in the Description column.

For example, First Records in above sample data has the extra , [beautify,laptop amazing price] which I want to eliminate.

While loading data into PIG 5:

INFILE1 = LOAD 'file1.csv' using PigStorage(',') as (id,Device,Description,HTML Code,Identifier)

There is a Data issue getting created.

Could you please suggest how to handle this data issue in Pig Script?

1

1 Answers

1
votes

If the file is a correct csv, it should have double quote at the begining and the end of the field that contains the coma. Then, you just have to load your data using CSVLoader : https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/CSVLoader.html.

register 'piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
INFILE1 = LOAD 'file1.csv' using CSVLoader() as (id,Device,Description,HTML Code,Identifier)

If you don't have any double quote, maybe you could try a ragex, knowing that your third field starts by "<" .. (use Regex function in Pig https://pig.apache.org/docs/r0.11.1/func.html#regex-extract-all). Tell me if you need more info.