3
votes

I've been trying to find a solution to the following Redshift riddle for a week now (think I am becoming obsessed with it):

There is a table of events in Redshift ("event_user_item"), with users triggering events for certain items by entering item's code that appears in event_value column.

Failed submission consists of event_type sequence PageLoad-ItemCode-ErrorResponse, but such event types are not necessarily consecutive, meaning there can be a number of other event types between each of them per user_id.

I am posting a small excerpt based on 3 different user_ids that should illustrate relevant scenarios focusing on Failed submissions.

ord_num event_type          event_value     user_id     event_datetime
1       PageLoad                            124         03/09/2018 21:48:39
2       ItemCode            LG56731         124         03/09/2018 21:48:53
4       Details1PageLoad                    124         03/09/2018 21:48:56
8       PageLoad                            124         03/09/2018 22:02:23
9       ItemCode            GU07019         124         03/09/2018 22:02:32
10      ErrorResponse       Some message    124         03/09/2018 22:02:32
51      PageLoad                            228         04/09/2018 12:38:30
52      ItemCode            EQ23487         228         04/09/2018 12:38:33
53      ErrorResponse       Some message    228         04/09/2018 12:38:34
54      PageLoad                            304         04/09/2018 15:43:14
55      ItemCode            OB68102         304         04/09/2018 15:43:57
56      ErrorResponse       Some message    304         04/09/2018 15:43:58
57      ItemCode            PB68102         304         04/09/2018 15:44:21
58      ErrorResponse       Some message    304         04/09/2018 15:44:22
59      PageLoad                            304         05/09/2018 11:19:37
60      ItemCode            OB68102         304         05/09/2018 11:20:17
62      Details1PageLoad                    304         05/09/2018 11:20:20

THE OBJECTIVE: find the number of Failed submissions per user_id per ItemCode. What is important is not to mix-up item codes from Failed submissions and Successful submissions. Also, there might be multiple Failure entries of the same item code as well.

I am not an expert in Redshift, especially with its window-functions, but the first idea I tried to stick to was a LAG function. In order to do that, I intended to identify sequences of ord_nums that would qualify to be counted, such as

ord_num event_type          event_value     user_id event_datetime           error?     sequence
1       PageLoad                            124     03/09/2018 21:48:39     
2       ItemCode            LG56731         124     03/09/2018 21:48:53     
4       Details1PageLoad                    124     03/09/2018 21:48:56     
8       PageLoad                            124     03/09/2018 22:02:23     
9       ItemCode            GU07019         124     03/09/2018 22:02:32     
10      ErrorResponse       Some message    124     03/09/2018 22:02:32     1       8-9-10
51      PageLoad                            228     04/09/2018 12:38:30     
52      ItemCode            EQ23487         228     04/09/2018 12:38:33     
53      ErrorResponse       Some message    228     04/09/2018 12:38:34     1       51-52-53
54      PageLoad                            304     04/09/2018 15:43:14     
55      ItemCode            OB68102         304     04/09/2018 15:43:57     
56      ErrorResponse       Some message    304     04/09/2018 15:43:58     1       54-55-56
57      ItemCode            PB68102         304     04/09/2018 15:44:21     
58      ErrorResponse       Some message    304     04/09/2018 15:44:22     1       54-57-58
59      PageLoad                            304     05/09/2018 11:19:37     
60      ItemCode            OB68102         304     05/09/2018 11:20:17     
62      Details1PageLoad                    304     05/09/2018 11:20:20     

So by user_id there should be following counts:

user_id     nr_failed_submissions   
124         1   
228         1   
304         2

However, as it is visible from the above data set and the expected outcome, it is not predictable how many records to move backwards, I need an additional condition that can't be put inside a LAG...

I've tried many options, but none of them fits.

Very useful and insightful posts have been

but until now, I haven't managed to fusion them all into solution that would work. There must be a way to do this in Redshift?

2

2 Answers

1
votes

This query will create "time ranges", where time1 represents the timestamp of a PageLoad event and time2 represents the timestamp of the next PageLoad event for that user:

WITH timeranges AS
(
  SELECT A.user_id,
         A.event_datetime AS time1,
         nvl(MAX(B.event_datetime),'2099-01-01') AS time2
  FROM foo AS A
    LEFT JOIN foo AS B
           ON A.user_id = B.user_id
          AND A.event_datetime < B.event_datetime
          AND A.event_type = B.event_type
  WHERE A.event_type = 'PageLoad'
  GROUP BY A.user_id,
           A.event_datetime
)

This query builds on that to associate each 'ItemCode' event with the timestamp of its corresponding 'PageLoad':

SELECT timeranges.time1 AS pageloadtime,
       foo.*
FROM foo
  LEFT JOIN timeranges
         ON foo.event_datetime >= timeranges.time1
        AND foo.event_datetime < timeranges.time2
WHERE foo.event_type = 'ItemCode'

This query determines whether any 'ErrorResponse' events have fallen in each of those ranges:

SELECT timeranges.time1 AS pageloadtime,
       timeranges.user_id,
       BOOL_OR(foo.event_type = 'ErrorResponse') AS has_error
FROM timeranges
  LEFT JOIN foo
         ON event_datetime > time1
        AND event_datetime < time2
GROUP BY timeranges.time1,
         timeranges.user_id
HAVING has_error;

And that should give us all the pieces we need -- for each pageload event, we know (1) whether that pageload has an error, and (2) we know all of the ItemCode events that are associated with that payload. Joining between those two result sets should give us what we're looking for.

A peculiarity in redshift gave me a bit of trouble trying to join those two datasets directly, so I had to create two temporary tables. This horrifically formatted query gave me the expected results:

create temporary table items_per_pageload as 
with timeranges as (select A.user_id, A.event_datetime as time1, nvl(max(B.event_datetime), '2099-01-01') as time2 from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad' group by A.user_id, A.event_datetime)
select timeranges.time1 as pageloadtime, event_user_item.* from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<timeranges.time2 where event_user_item.event_type='ItemCode'

create temporary table pageloads_with_errors as 
with timeranges as (select A.user_id, A.event_datetime as time1, nvl(max(B.event_datetime), '2099-01-01') as time2 from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad' group by A.user_id, A.event_datetime)
select timeranges.time1 as pageloadtime, timeranges.user_id, bool_or(event_user_item.event_type='ErrorResponse') as has_error from timeranges left join event_user_item on event_datetime > time1 and event_datetime < time2
group by timeranges.time1, timeranges.user_id having has_error;

select count(1), user_id, event_value from (
select items_per_pageload.* from items_per_pageload join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime 
) group by user_id, event_value
0
votes

Following approach and queries, based on Jason Rosendale's Answer 1, work for me as they are supposed to:

create temporary table items_per_pageload as 
with timeranges as (
  select A.user_id
    ,A.event_datetime as time1
    ,nvl(max(B.event_datetime), '2099-01-01') as time2
    ,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time 
  from event_user_item as A 
  left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
  where A.event_type='PageLoad' 
  group by A.user_id, A.event_datetime
  )
select timeranges.time1 as pageloadtime, event_user_item.* 
from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<nvl(timeranges.next_load_time,timeranges.time2) 
where event_user_item.event_type='ItemCode';

create temporary table pageloads_with_errors as 
with timeranges as (
  select A.user_id
    ,A.event_datetime as time1
    ,nvl(max(B.event_datetime), '2099-01-01') as time2
    ,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time 
  from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
  where A.event_type='PageLoad' 
  group by A.user_id, A.event_datetime
  )
select timeranges.time1 as pageloadtime,timeranges.user_id,bool_or(event_user_item.event_type='ErrorResponse') as has_error 
from timeranges 
left join event_user_item on event_datetime > time1 and event_datetime < nvl(next_load_time,time2)
group by timeranges.time1,timeranges.user_id 
having has_error;

/* final counts */
select count(1), user_id, event_value from (
    select items_per_pageload.* 
    from items_per_pageload 
    join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime 
) 
group by user_id, event_value;