4
votes

We have been trying quite hard to loop over data in (standard sql) BigQuery to no success.

I am not sure if it is the supported functionality of sql, our undestanding of the problem or the way we want to do this as we want to do it within BigQuery.

Anyhow, let's say we have a table of events where each event is described by a user id and a date (there can be many events on the same date by the same user id)

id  STRING
dt  DATE

One thing we want to know is how many distinct users generated events within a given period of time. This is rather trivial, just a COUNT on the table with the period as constraint in the WHERE clause. For example, if we have four months as our period of time:

SELECT
  COUNT(DISTINCT id) AS total
FROM
  `events`
WHERE
  dt BETWEEN DATE_ADD(CURRENT_DATE(), INTERVAL -4 MONTH)
  AND CURRENT_DATE()

However, our issues come if we want the history as well for other days (or weeks) recursively with the same given period of time. For example, for yesterday, the day before yesterday, etc... till... for example, 3 months ago. So the variable here would be CURRENT_DATE() that goes back by one day or whichever factor but the interval remains the same (in our case, 4 months). We are expecting something like this (with a factor of one day):

2017-07-14 2017-03-14 1760333
2017-07-13 2017-03-13 1856333
2017-07-12 2017-03-12 2031993
...
2017-04-14 2017-01-14 1999352

This is just a loop over every day, week, etc on the same table, and then a COUNT on the distinct events happening within that period of time. But we can't do 'loops' in BigQuery.

One way we thought was a JOIN, and then a COUNT on the GROUP BY intervals (taking advantage of the HAVING clause to simulate the period from a given day back to 4 months), but this is very inefficient and it just doesn't ever finish considering table's size (which has around 254 million records, 173 GB as of today, and it just keeps growing every day).

Another way we thought was using UDFs with the idea that we feed a list of date intervals to the function and then we function would apply the naive query (for counting) for every interval returning the interval and the count for that interval. But... UDFs in BigQuery do not support accessing tables within the UDF so we would have to sort of feed the whole table to the UDF which we haven't tried but doesn't seem reasonable.

So, we have no solution in mind to basically iterate over the same data and do calculations on parts of the data (overlapping parts as you see) within BigQuery and our only solution is doing this outside BigQuery (the loop functionality in the end).

Is there a way or someone can think of a way to do this all within BigQuery? Our goal would be to provide this as a view inside BigQuery so that it doesn't depend on an external system that needs to be triggered at the frequency that we set up (days/weeks/etc...).

2
SQL doesn't have loops. You should be able to use analytic functions i.e. OVER (PARTITION) and RANGE cloud.google.com/bigquery/docs/reference/standard-sql/…. To get started -> stackoverflow.com/questions/29899097/…Graham Polley
GENERATE_DATE_ARRAY should let you create a "loop" over the dates. I'll try to add an answer later unless someone else does first.Elliott Brossard
@ElliottBrossard - too late. Mikhail beat you to it ;-) And thanks for pointing me to GENERATE_DATE_ARRAY - never knew it existed!Graham Polley
@GrahamPolley, very aware of that of course, we are just trying to emulate it in a table by cross joining with itself pretty much. But it is that cross joining what gives the trouble (performance-wise I suppose) and a clever way of doing is what we haven't figured out nor I have found many examples. As @ElliottBrossard points, GENERATE_DATE_ARRAY seems possible but after trying the suggestion from @Will below, we get the error I point as a comment (and looking for that error suggests another approach to solve the problem).Guille

2 Answers

6
votes

Below is example of this technique for BigQuery Standard SQL

#standardSQL
SELECT 
  DAY,
  COUNT(CASE WHEN period = 7  THEN id END) AS days_07,
  COUNT(CASE WHEN period = 14 THEN id END) AS days_14,
  COUNT(CASE WHEN period = 30 THEN id END) AS days_30
FROM (
  SELECT
    dates.day AS DAY,
    periods.period AS period,
    id
  FROM yourTable AS activity
  CROSS JOIN (SELECT DAY FROM yourTable GROUP BY DAY) AS dates
  CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL 
                SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
  WHERE dates.day >= activity.day 
  AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
  GROUP BY 1,2,3
)
GROUP BY DAY
-- ORDER BY DAY 

You can play/test with this example using dummy data as below

#standardSQL
WITH data AS (
  SELECT 
    DAY, CAST(10 * RAND() AS INT64) AS id
  FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-07-13')) AS DAY
)
SELECT 
  DAY,
  COUNT(DISTINCT CASE WHEN period = 7  THEN id END) AS days_07,
  COUNT(DISTINCT CASE WHEN period = 14 THEN id END) AS days_14,
  COUNT(DISTINCT CASE WHEN period = 30 THEN id END) AS days_30
FROM (
  SELECT
    dates.day AS DAY,
    periods.period AS period,
    id
  FROM data AS activity
  CROSS JOIN (SELECT DAY FROM data GROUP BY DAY) AS dates
  CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL 
                SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
  WHERE dates.day >= activity.day 
  AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
  GROUP BY 1,2,3
)
GROUP BY DAY
ORDER BY DAY    
1
votes

Does it work for you?

WITH dates AS(
  SELECT GENERATE_DATE_ARRAY(DATE_SUB(CURRENT_DATE(), INTERVAL 4 MONTH), CURRENT_DATE()) arr_dates
),
data AS(
  SELECT 1 id, '2017-03-14' dt UNION ALL
  SELECT 1 id, '2017-03-14' dt UNION ALL
  SELECT 1, '2017-04-20' UNION ALL
  SELECT 2, '2017-04-20' UNION ALL
  SELECT 3, '2017-03-15' UNION ALL
  SELECT 4, '2017-04-20' UNION ALL
  SELECT 5, '2017-07-14'
)

SELECT
  i_date date,
  DATE_ADD(i_date, INTERVAL 4 MONTH) next_date,
  (SELECT COUNT(DISTINCT id) FROM data WHERE PARSE_DATE("%Y-%m-%d", data.dt) BETWEEN i_date AND DATE_ADD(i_date, INTERVAL 4 MONTH)) total
FROM dates,
UNNEST(arr_dates) i_date
ORDER BY i_date

Where data is a simulation of your events table.