0
votes

I ended up solving this problem by downloading all of the data and iterating through it in Python, but I wonder if there's a way to do it in BigQuery.

We have a table with start and end dates:

begin_date, end_date
'2016-02-19', '2016-02-19'
'2016-02-20', '2016-02-25'
'2016-02-21', '2016-02-25'
'2016-02-22', NULL

We want the count of rows for each date where begin_date <= date <= end_date. For any specific value, it's easy enough to select the count:

SELECT COUNT(*) FROM `table` WHERE begin_date <= '2016-12-19' AND (end_date >= '2016-12-19' OR end_date IS NULL)

So if I manually do this for each value I'm interested in, the desired output could look like this:

begin_date, count
2016-02-19, 1
2016-02-20, 1
2016-02-21, 2
2016-02-22, 3
2016-02-23, 3
2016-02-24, 3
2016-02-25, 3
2016-02-26, 1
etc.

It's easy enough to create the list of dates to iterate across:

WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example)

Now I am struggling with applying the above WHERE clause across all these dates. I see how a partition with a range works when matching against a single column (like here), but I need to match against both begin_date and end_date.

I thought I could do it with this:

SELECT
  status_begin_date,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE (e >= status_begin_date OR e IS NULL)) AS cnt
FROM (
  SELECT
    status_begin_date,
    ARRAY_AGG(status_end_date) OVER(ORDER BY status_begin_date) AS ends
  FROM `table`
)
ORDER BY status_begin_date

Taken from here. This works on the small example given in the StackOverflow answer, but I get a resource error using it on my table with several hundred million rows: enter image description here Is there a scalable solution in BigQuery?

2
Can you try this code? SELECT begin_date, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date <= example AND (end_date >= example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date - rmesteves
Please let me know if this is what you are searching for - rmesteves
@rmesteves Thanks, but this doesn't give the same results. I'm not exactly sure what the difference is. Sometimes they values are higher than expected and sometimes they're lower. - Xander Dunn
@rmesteves You can see the difference in results with this: sql WITH data AS ( SELECT DATE('2016-02-19') AS begin_date, DATE('2016-02-19') AS end_date UNION ALL SELECT '2016-02-20', '2016-02-25' UNION ALL SELECT '2016-02-21', '2016-02-25' UNION ALL SELECT '2016-02-22', NULL ), dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2016-02-19', '2016-02-26', INTERVAL 1 DAY)) AS example) SELECT begin_date, COUNT(*) FROM data CROSS JOIN dates WHERE begin_date <= example AND (end_date >= example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date - Xander Dunn
Can you try my code with this small change? Maybe the results can be optimized: SELECT example, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date <= example AND (end_date >= example OR end_date IS NULL) GROUP BY example ORDER BY example - rmesteves

2 Answers

2
votes

Below is for BigQuery Standard SQL and does not use inefficient cursor approach and rather using classic sql set based one

#standardSQL
WITH dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, CURRENT_DATE(), INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `table` AS (
  SELECT DATE '2016-02-19' begin_date, DATE '2016-02-19' end_date UNION ALL
  SELECT '2016-02-20', '2016-02-25' UNION ALL
  SELECT '2016-02-21', '2016-02-25' UNION ALL
  SELECT '2016-02-22', NULL
), dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, max_date, INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day
-- ORDER BY day  

with result

Row day         f0_  
1   2016-02-19  1    
2   2016-02-20  1    
3   2016-02-21  2    
4   2016-02-22  3    
5   2016-02-23  3    
6   2016-02-24  3    
7   2016-02-25  3    
0
votes

This nasty code worked:

DECLARE dates ARRAY <DATE>;
DECLARE x INT64 DEFAULT 0;
DECLARE results ARRAY <INT64>;
DECLARE results_dates ARRAY <DATE>;
DECLARE result INT64;
DECLARE date DATE;
SET dates = GENERATE_DATE_ARRAY('2016-02-17', '2019-05-13', INTERVAL 1 DAY);
LOOP
  SET date = dates[OFFSET(x)];
  SET result = (SELECT COUNT(*) FROM `table` WHERE begin_date <= date AND (end_date >= date OR end_date IS NULL));
  SET results = ARRAY_CONCAT(results, [result]);
  SET results_dates = ARRAY_CONCAT(results_dates, [date]);
  SET x = x + 1;
  IF x >= ARRAY_LENGTH(dates) THEN
    LEAVE;
  END IF;
END LOOP;
SELECT date, count_subscribers
FROM UNNEST(results_dates) AS date WITH OFFSET 
JOIN UNNEST(results) AS count_subscribers WITH OFFSET
USING(OFFSET)

Runtime of 1.5 hours, which is better than my Python code (7 hours), but the BigQuery code isn't parallelizable, whereas the Python code is.