0
votes

I have loaded 3 tables for 3 subjects in PIG, each having Name:chararray and score:float. The same name may not occur necessarily in all subjects.

I need to add the scores from the 3 tables into a single one, having Name and Total score.

I used to do this in SQL using nested queries. How to do this in PIG? I tried using full outer join, but was stuck after encountering null values in the Name columns for the subjects in which the name doesn't exist.

2
Could you please post what you have tried? A full outer JOIN is a possible approach here, but you will have to massage the output a bit after the JOIN. - mr2ert

2 Answers

0
votes

From the description of your problem, a simple UNION of the 3 files, followed by a GROUP BY should produce the results you're looking for.


Example Code:

data_1 = LOAD 'union1.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_2 = LOAD 'union2.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_3 = LOAD 'union3.csv' USING PigStorage(',') AS (name:chararray,score:float);

data = UNION data_1,data_2,data_3;

data_grp = GROUP data BY name;

data_gen = FOREACH data_grp GENERATE group, SUM(data.score);

dump data_gen;




Datasets:

bob,3
elvis,4
jim,4
dave,2
sneech,4
suess,3
giri,5



union2.csv

mike,2
rick,3
jim,3
giri,4
dave,3
elvis,5



union3.csv

bob,5
bing,4
suess,4
sneech,5
dave,4
jim,2
giri,2




Results:

(bob,8.0)
(jim,9.0)
(bing,4.0)
(dave,9.0)
(giri,11.0)
(mike,2.0)
(rick,3.0)
(elvis,9.0)
(suess,7.0)
(sneech,9.0)
0
votes

Though you can do this using an outer JOIN, I think that UNIONing all of the tables together and then GROUPing by the Name field will be easier.

-- T1, T2, and T3 are the tables you have loaded.  Each has the schema 
-- TX: (Name: chararray, score: float)

F = UNION T1, T2, T3;

-- Even if the name appears in only one table, then the result of GROUP will only
-- have one item in the bag. This means we can use SUM regardless of how many tables
-- the name is in.
G = GROUP F BY Name;

H = FOREACH G GENERATE group AS Name, SUM(F.score) AS totalscore;