0
votes


I want to achieve this in Pig but not sure about an efficient way. I have an input file(with header: COL1,COL2,COL3,COL4,TAG) and multiple "value" files all with similar format (TAG,VALUE). I want to append the "VALUE" column of each "value" file with the input file based on "TAG" as key column. So if there are 3 "value" files then format of final combined file will be (COL1,COL2,COL3,COL4,TAG,VALUE1,VALUE2,VALUE3).

One approach I can think of is to read each "value" file and then join with input file in an incremental way. So we will have multiple intermediate files. Like first join input file with one value file and output will be : COL1,COL2,COL3,COL4,TAG,VALUE1 .

Now this becomes new input file and join with another "value" file and output will be COL1,COL2,COL3,COL4,TAG,VALUE1,VALUE2.

Is there a better way ?

1

1 Answers

0
votes

You could use COGROUP with multiple relations, it will cause only one MR job. Following code was typed without testing, but the idea should work:

header = LOAD 'header_path' using PigStorage(',') AS (COL1,COL2,COL3,COL4,TAG);
tv_1 = LOAD 'tv_1' using PigStorage(',') AS (TAG,VALUE);
tv_2 = LOAD 'tv_2' using PigStorage(',') AS (TAG,VALUE);
tv_3 = LOAD 'tv_3' using PigStorage(',') AS (TAG,VALUE);

joined = COGROUP header BY TAG, tv_1 BY TAG, tv_2 BY TAG, tv_3 BY TAG;

result = FOREACH joined GENERATE FLATTEN(header), FLATTEN((IsEmpty(tv_1) ? TOBAG(TOTUPLE(null) : tv_1.VALUE)) AS VALUE1, FLATTEN((IsEmpty(tv_2) ? TOBAG(TOTUPLE(null) : tv_2.VALUE)) AS VALUE2, FLATTEN((IsEmpty(tv_3) ? TOBAG(TOTUPLE(null) : tv_3.VALUE)) AS VALUE3;