1
votes

In our project we have to periodically calculate aggregates and further calculations based on the input data received.

One frequent requirement is to calculate time differences between certain rows in our input data stream.

For example, this is my input datastream:

Timestamp Event Value
2017-05-21 11:33 e1 17
2017-05-21 11:37 e2 18
2017-05-21 11:38 e3 18
2017-05-21 11:39 e1 19
2017-05-21 11:42 e2 19

I now want to calculate all the timespans between e2 events and the last received e1 event (ordered by timestamp).

i would expect the result to be: 3 (minutes) 4 (minutes)

A similar requirement would be to calculate the timespans between the same type of event (i.e. all differences between e1 events) where i would expect this result: 6 (minutes)

My attempts so far:

This sort of analytics could pretty easily be achieved using the LAG function in conjunction with the WHEN clause but unfortunately the WHEN clause is missing in U-SQL. If it would be T-SQL it would also be possible to solve this using Sub-Selects in the SELECT-Clause of the statement but unfortunately this is also not possible in U-SQL.

Do you have any suggestions or sample scripts on how to solve this issue? Thank you very much for your help!

2

2 Answers

1
votes

In U-SQL, you can use c# methods for simple date arithmetic. If your data is as simple as you describe, you could just rank the e1 and e2 events then join them, something like this:

@data =
    EXTRACT Timestamp DateTime,
            Event string,
            Value int
    FROM "/input/input58.csv"
    USING Extractors.Csv();

//@data = SELECT *
//     FROM (
//        VALUES
//        ( "2017-05-21 11:33", "e1", 17 ),
//        ( "2017-05-21 11:37", "e2", 18 ),
//        ( "2017-05-21 11:38", "e3", 18 ),
//        ( "2017-05-21 11:39", "e1", 19 ),
//        ( "2017-05-21 11:42", "e2", 19 )
//     ) AS T(Timestamp, Event, Value);


@e1 =
    SELECT ROW_NUMBER() OVER(ORDER BY Timestamp) AS rn,
           *
    FROM @data
    WHERE Event == "e1";

@e2 =
    SELECT ROW_NUMBER() OVER(ORDER BY Timestamp) AS rn,
           *
    FROM @data
    WHERE Event == "e2";

@working =
    SELECT
        (e2.Timestamp - e1.Timestamp).TotalSeconds AS diff_sec,
        (e2.Timestamp - e1.Timestamp).ToString() AS diff_hhmmss,
        e1.Timestamp AS ts1,
        e2.Timestamp AS ts2
    FROM @e1 AS e1
            INNER JOIN @e2 AS e2 ON e1.rn == e2.rn;


OUTPUT @working TO "/output/output.csv"
USING Outputters.Csv(quoting:false);

My results, showing 4 and 3 minutes for sample data:

Results

Will that work for you? If not, please provide a more realistic data sample.

1
votes
@data =
    SELECT
        LAST_VALUE(Event == "e1" ? Timestamp : (DateTime?)null) OVER (ORDER BY Timestamp) AS E1Time
        // MAX(Event == "e1" ? Timestamp : DateTime.MinValue) OVER (ORDER BY Timestamp) AS E1Time
        , Timestamp AS E2Time
    FROM @events
    HAVING Event == "e2"
    ;

because aggregates/WFs ignore null (at least they should, U-SQL documentation for LAST_VALUE doesn't say, so needs verification). This allows emulation of conditional behavior such as WHEN. Similar behavior can be obtained with MAX/MIN and an appropriate default.

That said, you should spec the input data and expected result in detail, which may alter the solution. Namely, can aberrant data sequences occur and what behavior is expected (or at least tolerated for the sake of simplicity) if they do:

  1. e1, e1, e2 - Above code ignores earlier e1
  2. e1, e2, e2 - Above code computes 2 values wrt the same e1
  3. e1, e1, e2, e2 - Above code doesn't recognize nesting, same as case 2.
  4. e2 - Above code may crash (null) or throw results off by using DateTime.MinValue.

etc. At some point of complexity you'd probably have to defer to a custom reducer via REDUCE ALL (this is a last resort!), but that would restrict the size of data that can be processed.