0
votes

I have several CSV files in a HDFS folder which I load to a relation with:

source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.

When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)

("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")

<.... more records ....>

("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")

<.... more records ....>

So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.

So now, how can I transform the above structure and create a new relation like the following ?:

{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}

Where each header tuple is followed by a bag of record tuples belonging to that header ?. Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?

I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.

Hope my question is clear and look forward to some guidance here.

1

1 Answers

0
votes

This should get you started.
Code:

Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...