I've been trying to find a solution to the following Redshift riddle for a week now (think I am becoming obsessed with it):
There is a table of events in Redshift ("event_user_item"), with users triggering events for certain items by entering item's code that appears in event_value column.
Failed submission consists of event_type sequence PageLoad-ItemCode-ErrorResponse, but such event types are not necessarily consecutive, meaning there can be a number of other event types between each of them per user_id.
I am posting a small excerpt based on 3 different user_ids that should illustrate relevant scenarios focusing on Failed submissions.
ord_num event_type event_value user_id event_datetime
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
THE OBJECTIVE: find the number of Failed submissions per user_id per ItemCode. What is important is not to mix-up item codes from Failed submissions and Successful submissions. Also, there might be multiple Failure entries of the same item code as well.
I am not an expert in Redshift, especially with its window-functions, but the first idea I tried to stick to was a LAG function. In order to do that, I intended to identify sequences of ord_nums that would qualify to be counted, such as
ord_num event_type event_value user_id event_datetime error? sequence
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32 1 8-9-10
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34 1 51-52-53
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58 1 54-55-56
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22 1 54-57-58
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
So by user_id there should be following counts:
user_id nr_failed_submissions
124 1
228 1
304 2
However, as it is visible from the above data set and the expected outcome, it is not predictable how many records to move backwards, I need an additional condition that can't be put inside a LAG...
I've tried many options, but none of them fits.
Very useful and insightful posts have been
- lag function to get the last different value(redshift),
- Last Non-Null Value in Redshift by Group
- How can I get the Redshift/Postgresql LAG window function to selectively exclude records?
- Assign a Sequence (session ID) to my table based on A value in field
but until now, I haven't managed to fusion them all into solution that would work. There must be a way to do this in Redshift?