0
votes

MY SITUATION:

I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.

A (VERY) simplified version of the code can be seen below:

WITH start_data AS (
        SELECT user_id
        ,COUNT(web_visits) AS count_web_visits
        ,COUNT(button_clicks) AS count_button_clicks
        ,COUNT(login) AS count_log_in 
        ,SUM(time_on_site) AS total_time_on_site
        ,CURRENT_DATE AS run_date
        FROM web.table 
        WHERE  TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
        AND some_flag = 1
        AND some_other_flag = 2
        GROUP BY user_id
        ORDER BY user_id DESC 
)

The output might look something like the below:

| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256              | 932                 |16            |  1200              | 23-01-20 |
| 2391823 | 710              | 1345                |308           |  6000              | 23-01-20 |
| 3729128 | 67               | 204                 |83            |  320               | 23-01-20 |
| 5561296 | 437              | 339                 |172           |  3600              | 23-01-20 |

This output is then stored in it's own AWS/Redhsift table and will form base table for the task.

SELECT *
into myschema.base_table
FROM start_data 

DESIRED OUTPUT:

What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.

The way I see it, I can either go forwards or backwards, it doesn't matter.

That is to say, I can either:

  1. Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)

OR

  1. Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.

Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):

        | user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
        |---------|------------------|---------------------|--------------|--------------------|----------|
        | 1234567 | 412              | 339                 |180           |  3600              | 20-01-20 |
        | 2391823 | 417              | 6253                |863           |  2400              | 20-01-20 |
        | 3729128 | 67               | 204                 |83            |  320               | 20-01-20 |
        | 5561296 | 281              | 679                 |262           |  4200              | 20-01-20 | 
        |---------|------------------|---------------------|--------------|--------------------|----------|
        | 1234567 | 331              | 204                 |83            |  3200              | 21-01-20 |
        | 2391823 | 652              | 1222                |409           |  7200              | 21-01-20 |
        | 3729128 | 71               | 248                 |71            |  720               | 21-01-20 |
        | 5561296 | 366              | 722                 |519           |  3600              | 21-01-20 |
        |---------|------------------|---------------------|--------------|--------------------|----------|
        | 1234567 | 213              | 808                 |57            |  3600              | 22-01-20 |
        | 2391823 | 817              | 4265                |476           |  1200              | 22-01-20 |
        | 3729128 | 33               | 128                 |62            |  120               | 22-01-20 |
        | 5561296 | 623              | 411                 |283           |  2400              | 22-01-20 |
        |---------|------------------|---------------------|--------------|--------------------|----------|
        | 1234567 | 256              | 932                 |16            |  1200              | 23-01-20 |
        | 2391823 | 710              | 1345                |308           |  6000              | 23-01-20 |
        | 3729128 | 67               | 204                 |83            |  320               | 23-01-20 |
        | 5561296 | 437              | 339                 |172           |  3600              | 23-01-20 |

WHAT I HAVE TRIED: I have successfully created a WHILE loop to sequentially increment the date as follows:

CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE 
    start_date DATE := '2020-11-20' ;
BEGIN
  WHILE CURRENT_DATE > start_date
  LOOP 
    RAISE INFO 'Date: %', start_date;
    start_date = start_date + 1;
  END LOOP;
  RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;

CALL retrospective_data();

Thus producing the dates as follows:

INFO:  Date: 2020-11-20
INFO:  Date: 2020-11-21
INFO:  Date: 2020-11-22
INFO:  Date: 2020-11-23
INFO:  Date: 2020-11-24
INFO:  Date: 2020-11-25
INFO:  Date: 2020-11-26
INFO:  Loop Statment Executed Successfully
Query 1 OK: CALL

WHAT I NEED HELP WITH:

I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:

   WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date 

But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.

Any help appreciated.

1

1 Answers

1
votes

It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.

You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.

Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.

Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).

SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
    SELECT visit_date, user_id, web_visits, ...
    FROM web.table 
    WHERE some_flag = 1 AND some_other_flag = 2
    UNION ALL  -- this is where I want to union with a full set of dates by user_id
    ( SELECT visit_date, user_id, NULL as web_visits, ...
      FROM (
        SELECT DISTINCT user_id FROM web.table
        CROSS JOIN
        SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
        FROM web.table
      )
    )
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;

The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.

Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.