0
votes

I have the following relation in Apache PIG.

TSERIES: {ORDERED: {(timestamp: long,contentHost: chararray)},ts1: long}

And I want to do the following:

F = foreach TSERIES {
    ts = filter ORDERED by timestamp > TSERIES.ts1;
    generate ts;
}

In short, I want to keep all elements of bag ORDERED with a timestmap higher than ts1, but pig won't allow, specifically this part ts = filter ORDERED by timestamp > TSERIES.ts1;.

Is this possible? I'm using version 0.9.2-cdh4.0.1 (cloudera).

2
Does ts1 happen to be unique for each tuple by any chance? - Joe K
I have no strong guarantees, but I'd say it's unique for 99%. Since it's a timestamp, there's no hard rule that say that two timestamps cannot be exactly the same in this case (clickstream data). - Miguel Ping
I have the same issue with pig 0.14. Did you find a way to make it work ? - Romain Jouin
I think I used an UDF. - Miguel Ping

2 Answers

0
votes

Did you tried :

Test = filter tseries By (ordered.timestamp > ts1);

0
votes

I'm not sure if there's a way to do this without a UDF... it seems like there should be, but I can't figure it out either. Anyway, you could either write a UDF to do this directly: go through the bag, filter out some, and return a bag. Or, you could write a UDF to generate UUIDs and then flatten the bag and re-group it - smoething like this:

a = foreach TSERIES generate ORDERED, ts1, myudfs.GenerateUUID() as id;
b = foreach a generate FLATTEN(ORDERED) as ts, ts1, id;
c = filter b by ts.timestamp > ts1;
d = group c by id;