I have a table of consisting of email, day (TIMESTAMP), id, and multiple other columns.
For each email entry, I would like to count the number of unique ids associated with that email in the preceding 3 days.
+-------------------+-------------------------+------------+----+
| email | day | other cols | id |
+-------------------+-------------------------+------------+----+
| [email protected] | 2020-06-21 16:31:00 UTC | ... | 0 |
| [email protected] | 2020-06-22 14:54:00 UTC | ... | 1 |
| [email protected] | 2020-06-23 08:23:00 UTC | ... | 0 |
| [email protected] | 2020-06-24 13:51:00 UTC | ... | 0 |
| [email protected] | 2020-06-25 09:54:00 UTC | ... | 2 |
| [email protected] | 2020-06-25 12:25:00 UTC | ... | 0 |
| [email protected] | 2020-06-26 15:21:00 UTC | ... | 2 |
| [email protected] | 2020-06-21 12:23:00 UTC | ... | 0 |
| [email protected] | 2020-06-21 16:54:00 UTC | ... | 0 |
| [email protected] | 2020-06-22 08:23:00 UTC | ... | 0 |
| [email protected] | 2020-06-22 12:13:00 UTC | ... | 1 |
| [email protected] | 2020-06-24 09:32:00 UTC | ... | 1 |
| [email protected] | 2020-06-25 05:45:00 UTC | ... | 1 |
| [email protected] | 2020-06-26 12:32:00 UTC | ... | 2 |
| [email protected] | 2020-06-27 19:53:00 UTC | ... | 1 |
+-------------------+-------------------------+------------+----+
The additional column should look like this:
+-------------------+-------------------------+------------+----+-----------------------------+
| email | day | other cols | id | distinct ids in last 3 days |
+-------------------+-------------------------+------------+----+-----------------------------+
| [email protected] | 2020-06-21 16:31:00 UTC | ... | 0 | 1 |
| [email protected] | 2020-06-22 14:54:00 UTC | ... | 1 | 2 |
| [email protected] | 2020-06-23 08:23:00 UTC | ... | 0 | 2 |
| [email protected] | 2020-06-24 13:51:00 UTC | ... | 0 | 2 |
| [email protected] | 2020-06-25 09:54:00 UTC | ... | 2 | 3 |<- 3, because ids 0, 1 and 2 have been seen in previous 3 days
| [email protected] | 2020-06-25 12:25:00 UTC | ... | 0 | 3 |
| [email protected] | 2020-06-26 15:21:00 UTC | ... | 2 | 2 |
| [email protected] | 2020-06-21 12:23:00 UTC | ... | 0 | 1 |
| [email protected] | 2020-06-21 16:54:00 UTC | ... | 0 | 1 |
| [email protected] | 2020-06-22 08:23:00 UTC | ... | 0 | 1 |
| [email protected] | 2020-06-22 12:13:00 UTC | ... | 1 | 2 |
| [email protected] | 2020-06-24 09:32:00 UTC | ... | 1 | 2 |
| [email protected] | 2020-06-25 05:45:00 UTC | ... | 1 | 2 |
| [email protected] | 2020-06-26 12:32:00 UTC | ... | 1 | 1 |
| [email protected] | 2020-06-27 19:53:00 UTC | ... | 1 | 1 |
+-------------------+-------------------------+------------+----+-----------------------------+
I have tried using a window function to partition by email and count distinct ids over the preceding 3 days.
COUNT(DISTINCT id) OVER (PARTITION BY email ORDER BY UNIX_DATE(PARSE_DATE('%Y-%m-%d', day))*24*3600 RANGE BETWEEN 3*24*3600 PRECEDING AND CURRENT ROW)
However this is not allowed:
Window ORDER BY is not allowed if DISTINCT is specified
There are solutions on stack overflow, such as this. However I'm not certain that it accounts for the need to partition by email before counting unique ids.
I'd be grateful for any pointers on this. If it's easier, I'd also be open to a solution that uses DATE rather than TIMESTAMP.