2
votes

I have a query. I have a data in the format id:int, name:chararray

1, abc
1, def
2, ghi,
2, mno
2, pqr

After that I do Group BY id and my data becomes

1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}

Now I wan to pick a random value from the bag and I want the output like

1, abc
2, mno

In case we picked up like first tuple for 1 or second tuple for 2

Any idea what How this can be done ?

The question is I have grouped data B;

DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}

C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}

rand =
    FOREACH B {
        shuf_ = FOREACH C GENERATE RANDOM() AS r, *;  line L
        shuf = ORDER shuf_ BY r;
        pick1 = LIMIT shuf 1;
    GENERATE
        group,
        FLATTEN(pick1);
    };

I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"

1

1 Answers

1
votes

Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.

Script:

data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
    FOREACH grpd {
        shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
        shuf = ORDER shuf_ BY r;
        pick1 = LIMIT shuf 1;
    GENERATE
        group,
        FLATTEN(pick1.f2);
    };
DUMP rand;

Output:

(1,abc)
(2,ghi)

Running it again:

(1,abc)
(2,pqr)

And again:

(1,def)
(2,pqr)

One more time!

(1,abc)
(2,ghi)

Whee!

(1,def)
(2,mno)