0
votes

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin. i have a scenario where i have to filter data from a csv file. the csv is on hdfs, and has two columns.

grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")

what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count. i.e

("first 2001","2")
("second 1000","3")

and so on.

I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value. code is given below:

convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);

fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".

data = FOREACH fil GENERATE STRSPLIT($0, '~');

X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);

Y = FOREACH X GROUP BY X;

Z = FOREACH Y GENERATE COUNT(Y);

var = FOREACH Z GENERATE CONCAT(Y,',',Z);

STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');
1
Incidentally, I don't recommend stressing how urgent something is on help forums - you'll get speedy help here if you write your question well. Just be aware that some people downvote if they see this, since all help provided is at the leisure of volunteers :). - halfer

1 Answers

1
votes

STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:

data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);