1
votes

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data

format(twitterId,comment,userRefId):

Sample Data

When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:

data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
2
Is it an option to modify the source data to use a different separator?darkownage
@darkownage: yesVinita Gupta

2 Answers

0
votes

Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.

A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;

EDIT: I used sample text to run the script and works fine.See below

enter image description here

0
votes

If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.

If you change your separator to a |, your code could look like:

data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);