0
votes

I have some data that looks something like this, where I have a list of events of different types, the time the event occurred, and the user the event is associated with. Event A only occurs once per user but event B can occur multiple times.

Raw Data

and I want to turn it into this:

Processed Data

The logic is as follows:

  1. I query the table to extract a list of the times event A occurred for each user
  2. For each row I then find the most recent time that event B occurred prior to event A

The first point is easy enough, something along the lines of

SELECT User, Time AS Event_A_Time FROM Raw_Data WHERE Event = 'Event A'

Where I get stuck is working out how to get the third column. It will need to be something like

SELECT MAX(Time) FROM Raw_Data WHERE Time < [Event_A_Time for this user]

I'm assuming I need to use a UDF for this, but I'm struggling with how to fit this into the syntax (using standard SQL). Any help appreciated!

1

1 Answers

1
votes

You shouldn't need a UDF for this kind of calculation. Try this:

SELECT * EXCEPT(Event_B_Times),
  (SELECT MAX(time) FROM UNNEST(Event_B_Times) AS Time
   WHERE Time < Event_A_Time) AS Event_B_Time
FROM (
  SELECT
    User,
    MAX(IF(Event = 'Event A', Time, NULL)) AS Event_A_Time,
    ARRAY_AGG(IF(Event = 'Event B', Time, NULL) IGNORE NULLS) AS Event_B_Times
  FROM Raw_Data
  GROUP BY User
);