Distinct count over a sliding window

Question

I have 3 set of columns [date, status, customerid]. I need an elegant way to calculate unique customers from Tuesday to day, meaning on a Tuesday the count will include only distinct customers for that Tuesday. On Wednesday the count includes distinct customers from Tuesday and Wednesday. On a Thursday, the count should include distinct customers from Tuesday, Wednesday and Thursday. This will repeat until the following Monday and on Tuesday the cycle restarts. How do I approach this logic using SQL? Also assuming count distinct over window function is not supported.

Assume this is the source

+----------+-----------+---------+                                              
|      date|customer_id|   status|
+----------+-----------+---------+
|2020-06-08|        001|AVAILABLE|
|2020-06-08|        001|  EXPIRED|
|2020-06-08|        001|AVAILABLE|
|2020-06-08|        002|AVAILABLE|
|2020-06-08|        002|  EXPIRED|
|2020-06-08|        003|  EXPIRED|
|2020-06-08|        003|AVAILABLE|
|2020-06-09|        001|AVAILABLE|
|2020-06-09|        001|AVAILABLE|
|2020-06-09|        002|  EXPIRED|
|2020-06-09|        003|AVAILABLE|
|2020-06-09|        003|  EXPIRED|
|2020-06-09|        003|  EXPIRED|
|2020-06-10|        001|  EXPIRED|
|2020-06-10|        001|  EXPIRED|
|2020-06-10|        001|AVAILABLE|
|2020-06-10|        001|AVAILABLE|
|2020-06-10|        002|AVAILABLE|
|2020-06-10|        002|AVAILABLE|
|2020-06-10|        002|  EXPIRED|
|2020-06-10|        002|AVAILABLE|
|2020-06-10|        002|  EXPIRED|
|2020-06-10|        003|  EXPIRED|
|2020-06-10|        003|AVAILABLE|
|2020-06-10|        003|AVAILABLE|
|2020-06-11|        001|  EXPIRED|
|2020-06-11|        001|  EXPIRED|
|2020-06-12|        001|AVAILABLE|
|2020-06-12|        001|  EXPIRED|
|2020-06-12|        003|  EXPIRED|
|2020-06-12|        003|AVAILABLE|
|2020-06-12|        004|AVAILABLE|
|2020-06-13|        001|AVAILABLE|
|2020-06-13|        002|AVAILABLE|
|2020-06-13|        002|AVAILABLE|
|2020-06-13|        002|AVAILABLE|
|2020-06-14|        001|  EXPIRED|
|2020-06-14|        003|  EXPIRED|
|2020-06-14|        004|  EXPIRED|
|2020-06-15|        001|  EXPIRED|
|2020-06-15|        001|AVAILABLE|
|2020-06-15|        001|  EXPIRED|
|2020-06-15|        003|  EXPIRED|
|2020-06-15|        003|AVAILABLE|
|2020-06-16|        001|AVAILABLE|
|2020-06-16|        001|  EXPIRED|
|2020-06-16|        002|AVAILABLE|
|2020-06-16|        002|AVAILABLE|
|2020-06-16|        002|  EXPIRED|
|2020-06-16|        002|  EXPIRED|
|2020-06-16|        003|  EXPIRED|
+----------+-----------+---------+

This is the expected outcome

+----------+-----------+---------+                                              
|      date|      count|   status|
+----------+-----------+---------+
|2020-06-08|       NULL|     NULL|
|2020-06-09|          2|AVAILABLE|
|2020-06-09|          2|  EXPIRED|
|2020-06-10|          3|  EXPIRED|
|2020-06-10|          3|AVAILABLE|
|2020-06-11|          3|AVAILABLE|
|2020-06-11|          3|  EXPIRED|
|2020-06-12|          4|AVAILABLE|
|2020-06-12|          3|  EXPIRED|
|2020-06-13|          4|AVAILABLE|
|2020-06-14|          4|  EXPIRED|
|2020-06-15|          4|  EXPIRED|
|2020-06-15|          4|AVAILABLE|
|2020-06-16|          3|  EXPIRED|
|2020-06-16|          2|AVAILABLE|
+----------+-----------+---------+

Gordon Linoff Gordon Linoff · Accepted Answer · 2020-06-20T11:12:06

This answers the original version of the question.

Your question has two parts.

One is identifying the right time frame. In PrestoDB weeks do not begin on Tuesday. I think they begin on Monday.

The other is the count(distinct). I think Presto supports this as a window function. So, you can do:

select distinct date,
       count(distinct user_id) over (partition by date_trunc('week', date - interval '1 day')
                                     order by date
                                    ) as unique_users_since_tuesday
from t;

Note that this is phrased as a window function, not an aggregation.

Distinct count over a sliding window

1 Answers