11
votes

I know there is an AVG function in Big Query and there are window functions to shift previous or next values up or down a row, but is there any function that allows you to average over a specified interval? For instance I'd like to so something like the following:

SELECT
    city
    AVG(temperature) OVER(PARTITION BY city, INTERVAL day,14, ORDER BY day) as rolling_avg_14_days,
    AVG(temperature) OVER(PARTITION BY city, INTERVAL day,30, ORDER BY day) as rolling_avg_30_days,
WHERE
    city IN ("Los Angeles","Chicago","Sun Prairie","Sunnyvale")
    AND year BETWEEN 1900 AND 2013

I'd like to do a rolling average calculation that allows me to specify a range of values to do an aggregation function over, and what value to order by. The average function would take the current day temp and previous 13 days (or previous 29) to calculate and average. Is this possible today? I know I could do something like this if I put 13 LAG/OVER fields in the SELECT statement and then avg the results of all of them, but that is a lot of overhead.

2
I did find a previous post from Jordan about using a JOIN to produce a rolling average. I tried using that logic and it works great, however I'm wondering if there is a window function in development that would make the logic a bit simpler.Steve Scherer

2 Answers

20
votes

I think OVER with RANGE construction of Window Functions is best fit here

Assuming that day field is represented as 'YYYY-MM-DD' format, below query does the rolling averages

SELECT
  city,
  day,
  AVG(temperature) OVER(PARTITION BY city ORDER BY ts 
                RANGE BETWEEN 14*24*3600 PRECEDING AND CURRENT ROW) AS rolling_avg_14_days,
  AVG(temperature) OVER(PARTITION BY city ORDER BY ts 
                RANGE BETWEEN 30*24*3600 PRECEDING AND CURRENT ROW) AS rolling_avg_30_days
FROM (
  SELECT day, city, temperature, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts 
  FROM temperatures
)

You most likely already found this solution long ago, but still wanted to have what i think a better answer (as of today) here for this question

0
votes

A different option with JOIN EACH (this can get too slow as an extremely large amount of data can get generated in the intermediate steps):

SELECT a.SensorId SensorId, a.Timestamp, AVG(b.Data) AS avg_prev_hour_load
FROM (
  SELECT * FROM [io_sensor_data.moscone_io13]
  WHERE SensorId = 'XBee_40670EB0/mic') a
JOIN EACH [io_sensor_data.moscone_io13] b
ON a.SensorId = b.SensorId
WHERE b.Timestamp BETWEEN (a.Timestamp - 36000000) AND a.Timestamp
GROUP BY SensorId, a.Timestamp;

(based on Joe Celko's SQL problems)

For window functions, it could be useful to have one implementing a bigger range, but for now I would automate the generation of the query.