2
votes

I have 2 data sources. One contains a list of api calls and the other contains all related authentication events. There can be multiple Auth Events for each Api Call, I want to find the auth event that:
a) contains the same "identifier" as the Api Call
b) happened within a second after the Api Call
c) is the closest to the Api Call after the above filtering.

I had planned to loop through each ApiCall event in a foreach loop and then use filter statements on the authevents to find the correct one - however, it does not appear that this is possible (USING Filter in a Nested FOREACH in PIG)

Would anyone be able to suggest other ways to achieve this. If it helps, here's the Pig script I tried to use:

apiRequests = LOAD '/Documents/ApiRequests.txt' AS (api_fileName:chararray, api_requestTime:long, api_timeFromLog:chararray, api_call:chararray, api_leadString:chararray, api_xmlPayload:chararray, api_sourceIp:chararray, api_username:chararray, api_identifier:chararray);
authEvents = LOAD '/Documents/AuthEvents.txt' AS (auth_fileName:chararray, auth_requestTime:long, auth_timeFromLog:chararray, auth_call:chararray, auth_leadString:chararray, auth_xmlPayload:chararray, auth_sourceIp:chararray, auth_username:chararray, auth_identifier:chararray);
specificApiCall = FILTER apiRequests BY api_call == 'CSGetUser';                 -- Get all events for this specific call
match = foreach specificApiCall {                                                -- Now try to get the closest mathcing auth event
        filtered1 = filter authEvents by auth_identifier == api_identifier;      -- Only use auth events that have the same identifier (this will return several)
        filtered2 = filter filtered1 by (auth_requestTime-api_requestTime)<1000; -- Further refine by usings auth events within a second on the api call's tiime
        sorted = order filtered2 by auth_requestTime;                            -- Get the auth event that's closest to the api call
        limited = limit sorted 1;
        generate limited;
        };
dump match;
1

1 Answers

1
votes

Nested FOREACH is not for working with a second relation while looping over the first one. It's for when your relation has a bag in it and you want to work with that bag as though it were its own relation. You cannot work with apiRequests and authEvents at the same time unless you do some kind of joining or grouping first to put all the information you need into a single relation.

Your task works nicely conceptually with a JOIN and FILTER, if you did not need to limit yourself to a single authorization event:

allPairs = JOIN specificApiCall BY api_identifier, authEvents BY auth_identifier;
match = FILTER allPairs BY (auth_requestTime-api_requestTime)<1000;

Now all the information is together, and you could do GROUP match BY api_identifier followed by a nested FOREACH to pick out a single event.

However, you could do this in a single step if you use the COGROUP operator, which is like JOIN but without the cross-product -- you get two bags with the grouped records from each relation. Use this to pick out the nearest authorization event:

cogrp = COGROUP specificApiCall BY api_identifier, authEvents BY auth_identifier;
singleAuth = FOREACH cogrp {
    auth_sorted = ORDER authEvents BY auth_requestTime;
    auth_1 = LIMIT auth_sorted 1;
    GENERATE FLATTEN(specificApiCall), FLATTEN(auth_1);
    };

Then FILTER to only leave the ones within 1 second:

match = FILTER singleAuth BY (auth_requestTime-api_requestTime)<1000;