1
votes

I have a file that has a record of all usage of our product, which includes user id, number of calls made and the date the calls were made (it's rolled up to the date, by user id).

user_id | num_calls | date

123 | 32 | 2018-04-17

435 | 21 | 2018-04-17

123 | 35 | 2018-04-18

435 | 10 | 2018-04-18

123 | 20 | 2018-04-19

435 | 90 | 2018-04-20

I want to produce a chart that shows, for each day in the past and going forward, users, who were active in the 30 days prior to and including that date, and how many calls they made over that 30 day period. Ultimately, I will be using this to set various thresholds for "high usage" in a given 30 day period. It would look like this:

user_id | num_calls_in_previous_30_days | date

123 | 32 | 2018-04-17

435 | 21 | 2018-04-17

123 | 67 | 2018-04-18

435 | 31 | 2018-04-18

123 | 87 | 2018-04-19

435 | 31 | 2018-04-19

123 | 87 | 2018-04-20

435 | 121 | 2018-04-20

The issue I'm having is that when I try to use the window function

sum(num_calls) over (partition by id ORDER BY UNIX_SECONDS(timestamp(date)) range BETWEEN 2505600 PRECEDING AND CURRENT ROW)

I only get the total number of calls in the last 30 days for users who were active on each specific date as opposed to including all users who were active in the 30 days prior to that date and their usage over that time frame. Using the same data from above, it looks like this:

user_id | num_calls_in_previous_30_days | date

123 | 32 | 2018-04-17

435 | 21 | 2018-04-17

123 | 67 | 2018-04-18

435 | 31 | 2018-04-18

123 | 87 | 2018-04-19

435 | 121 | 2018-04-20

I tried another route, which was getting all unique user_ids from the previous 30 days from each date, but I wasn't sure how to join this with my existing usage data to get my desired result.

I'm sure there's a simple solution here, but I've spent a few hours on it and can't seem to wrap my head around how to solve this.

Thanks in advance!

1

1 Answers

0
votes

Below example is for BigQuery Standard SQL

#standardSQL
WITH dates AS (
  SELECT DATE
  FROM (
    SELECT MIN(DATE) start, MAX(DATE) finish
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(start, finish)) DATE
), users AS (
  SELECT DISTINCT user_id 
  FROM `project.dataset.table`
)
SELECT user_id, num_calls, DATE,
  SUM(num_calls) OVER (win30days) num_calls_in_previous_30_days
FROM users
CROSS JOIN dates
LEFT JOIN `project.dataset.table` USING(DATE, user_id)
WINDOW win30days AS (
  PARTITION BY user_id 
  ORDER BY UNIX_SECONDS(TIMESTAMP(DATE)) 
  RANGE BETWEEN 2505600 PRECEDING AND CURRENT ROW
)

You can test, play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 123 user_id, 32 num_calls, DATE '2018-04-17' DATE UNION ALL
  SELECT 435, 21, '2018-04-17' UNION ALL
  SELECT 123, 35, '2018-04-18' UNION ALL
  SELECT 435, 10, '2018-04-18' UNION ALL
  SELECT 123, 20, '2018-04-19' UNION ALL
  SELECT 435, 90, '2018-04-20' 
), dates AS (
  SELECT DATE
  FROM (
    SELECT MIN(DATE) start, MAX(DATE) finish
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(start, finish)) DATE
), users AS (
  SELECT DISTINCT user_id 
  FROM `project.dataset.table`
)
SELECT user_id, num_calls, DATE,
  SUM(num_calls) OVER (win30days) num_calls_in_previous_30_days
FROM users
CROSS JOIN dates
LEFT JOIN `project.dataset.table` USING(DATE, user_id)
WINDOW win30days AS (
  PARTITION BY user_id 
  ORDER BY UNIX_SECONDS(TIMESTAMP(DATE)) 
  RANGE BETWEEN 2505600 PRECEDING AND CURRENT ROW
)
-- ORDER BY DATE, user_id

with result as

Row user_id num_calls   DATE            num_calls_in_previous_30_days    
1   123     32          2018-04-17      32   
2   435     21          2018-04-17      21   
3   123     35          2018-04-18      67   
4   435     10          2018-04-18      31   
5   123     20          2018-04-19      87   
6   435     null        2018-04-19      31   
7   123     null        2018-04-20      87   
8   435     90          2018-04-20      121