We have been trying quite hard to loop over data in (standard sql) BigQuery to no success.
I am not sure if it is the supported functionality of sql, our undestanding of the problem or the way we want to do this as we want to do it within BigQuery.
Anyhow, let's say we have a table of events where each event is described by a user id and a date (there can be many events on the same date by the same user id)
id STRING
dt DATE
One thing we want to know is how many distinct users generated events within a given period of time. This is rather trivial, just a COUNT on the table with the period as constraint in the WHERE clause. For example, if we have four months as our period of time:
SELECT
COUNT(DISTINCT id) AS total
FROM
`events`
WHERE
dt BETWEEN DATE_ADD(CURRENT_DATE(), INTERVAL -4 MONTH)
AND CURRENT_DATE()
However, our issues come if we want the history as well for other days (or weeks) recursively with the same given period of time. For example, for yesterday, the day before yesterday, etc... till... for example, 3 months ago. So the variable here would be CURRENT_DATE() that goes back by one day or whichever factor but the interval remains the same (in our case, 4 months). We are expecting something like this (with a factor of one day):
2017-07-14 2017-03-14 1760333
2017-07-13 2017-03-13 1856333
2017-07-12 2017-03-12 2031993
...
2017-04-14 2017-01-14 1999352
This is just a loop over every day, week, etc on the same table, and then a COUNT on the distinct events happening within that period of time. But we can't do 'loops' in BigQuery.
One way we thought was a JOIN, and then a COUNT on the GROUP BY intervals (taking advantage of the HAVING clause to simulate the period from a given day back to 4 months), but this is very inefficient and it just doesn't ever finish considering table's size (which has around 254 million records, 173 GB as of today, and it just keeps growing every day).
Another way we thought was using UDFs with the idea that we feed a list of date intervals to the function and then we function would apply the naive query (for counting) for every interval returning the interval and the count for that interval. But... UDFs in BigQuery do not support accessing tables within the UDF so we would have to sort of feed the whole table to the UDF which we haven't tried but doesn't seem reasonable.
So, we have no solution in mind to basically iterate over the same data and do calculations on parts of the data (overlapping parts as you see) within BigQuery and our only solution is doing this outside BigQuery (the loop functionality in the end).
Is there a way or someone can think of a way to do this all within BigQuery? Our goal would be to provide this as a view inside BigQuery so that it doesn't depend on an external system that needs to be triggered at the frequency that we set up (days/weeks/etc...).
OVER (PARTITION)
andRANGE
cloud.google.com/bigquery/docs/reference/standard-sql/…. To get started -> stackoverflow.com/questions/29899097/… – Graham PolleyGENERATE_DATE_ARRAY
should let you create a "loop" over the dates. I'll try to add an answer later unless someone else does first. – Elliott BrossardGENERATE_DATE_ARRAY
- never knew it existed! – Graham PolleyGENERATE_DATE_ARRAY
seems possible but after trying the suggestion from @Will below, we get the error I point as a comment (and looking for that error suggests another approach to solve the problem). – Guille