I ended up solving this problem by downloading all of the data and iterating through it in Python, but I wonder if there's a way to do it in BigQuery.
We have a table with start and end dates:
begin_date, end_date
'2016-02-19', '2016-02-19'
'2016-02-20', '2016-02-25'
'2016-02-21', '2016-02-25'
'2016-02-22', NULL
We want the count of rows for each date where begin_date <= date <= end_date. For any specific value, it's easy enough to select the count:
SELECT COUNT(*) FROM `table` WHERE begin_date <= '2016-12-19' AND (end_date >= '2016-12-19' OR end_date IS NULL)
So if I manually do this for each value I'm interested in, the desired output could look like this:
begin_date, count
2016-02-19, 1
2016-02-20, 1
2016-02-21, 2
2016-02-22, 3
2016-02-23, 3
2016-02-24, 3
2016-02-25, 3
2016-02-26, 1
etc.
It's easy enough to create the list of dates to iterate across:
WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example)
Now I am struggling with applying the above WHERE clause across all these dates. I see how a partition with a range works when matching against a single column (like here), but I need to match against both begin_date and end_date.
I thought I could do it with this:
SELECT
status_begin_date,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE (e >= status_begin_date OR e IS NULL)) AS cnt
FROM (
SELECT
status_begin_date,
ARRAY_AGG(status_end_date) OVER(ORDER BY status_begin_date) AS ends
FROM `table`
)
ORDER BY status_begin_date
Taken from here. This works on the small example given in the StackOverflow answer, but I get a resource error using it on my table with several hundred million rows:
Is there a scalable solution in BigQuery?
sql WITH data AS ( SELECT DATE('2016-02-19') AS begin_date, DATE('2016-02-19') AS end_date UNION ALL SELECT '2016-02-20', '2016-02-25' UNION ALL SELECT '2016-02-21', '2016-02-25' UNION ALL SELECT '2016-02-22', NULL ), dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2016-02-19', '2016-02-26', INTERVAL 1 DAY)) AS example) SELECT begin_date, COUNT(*) FROM data CROSS JOIN dates WHERE begin_date <= example AND (end_date >= example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date- Xander Dunn