3
votes

I'm a BigQuery and SQL newbie that's continuing to tackle grouping problems. Using Standard SQL in BigQuery, I'd like to group data by X days. Here's a table of data:

event_id |    url    |          timestamp   
-----------------------------------------------------------
   xx         a.html      2016-10-18 15:55:16 UTC
   xx         a.html      2016-10-19 16:68:55 UTC
   xx         a.html      2016-10-25 20:55:57 UTC
   yy         b.html      2016-10-18 15:58:09 UTC
   yy         b.html      2016-10-18 08:32:43 UTC
   zz         a.html      2016-10-20 04:44:22 UTC
   zz         c.html      2016-10-21 02:12:34 UTC

I want to count the number of each event that occurred on each url in intervals of X days, starting from a given date. For example: how could I group this in intervals of 3 days, where my first interval starts on 2016-10-18 00:00:00 UTC? In addition, can I assign the 3rd day of the interval to each row? Example output:

event_id |    url    |  count |        3dayIntervalLabel   
-----------------------------------------------------------
   xx         a.html      2           2016-10-20  --> [18th thru 20th]
   yy         b.html      2           2016-10-20
   zz         a.html      1           2016-10-20 
   zz         c.html      1           2016-10-23  --> [21th thru 23th]
   xx         a.html      1           2016-10-26  --> [24th thru 26th]

I added three annotations to clarify the 3dayIntervalLabel values.

In general, I'm hoping to solve: group by intervals of X days, starting from date Y, and label the intervals using the final date of the each interval.

Please let me know if more clarification is needed.

If you're interested, I've also asked similar questions on StackOverflow (and gotten answers) about grouping this data using a rolling window: initial question and follow-up.

Thanks!

2
Not a complete answer, but you may be able to make use of GENERATE_DATE_ARRAY if you need a sequence of dates against which to join.Elliott Brossard
The answer I accepted does in fact make use of GENERATE_DATE_ARRAY() and a join!The Traveling Coder

2 Answers

3
votes
WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day, DATE_ADD(day, INTERVAL 2 DAY) AS endday
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-10-18', '2016-11-06', INTERVAL 3 DAY)) AS day
)
SELECT 
  event_id, 
  url,  
  SUM(events) AS `count`, 
  c.endday AS `ThreedayIntervalLabel`  
FROM calendar AS c
JOIN dailyAggregations AS a
ON a.day BETWEEN c.day AND c.endday
GROUP BY endday, url, event_id
0
votes

If you have a base date, then something like this:

select floor(date_diff(date(timestamp), date '2016-10-18', day) / 3) as days,
       count(*)
from t
group by days
order by days;