0
votes

I would like to run the following query in BigQuery, ideally as efficiently as possible. The idea is that I have all of these rows corresponding to tests (taken daily) by millions of users and I want to determine, of the users who have been active for over a year, how much each user has improved.

"Improvement" in this case is the average of the first N subtracted from the last N.

In this example, N is 30. (I've also added in the where cnt >= 100 part because I don't want to consider users who took a test a long time ago and just came back to try once more.)

select user_id,
       avg(score) filter (where seqnum_asc <= 30) as first_n_avg,
       avg(score) filter (where seqnum_desc <= 30) as last_n_avg
from (select *,
             row_number() over (partition by user_id order by created_at) as seqnum_asc,
             row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
             count(*) over (partition by user_id) as cnt
      from tests
     ) t
where cnt >= 100
group by user_id
having max(created_at) >= min(created_at) + interval '1 year';
1

1 Answers

0
votes

Just use conditional aggregation and fix the date functions:

select user_id,
       avg(case when seqnum_asc <= 30 then score end) as first_n_avg,
       avg(case when seqnum_desc <= 30 then score end) as last_n_avg
from (select *,
             row_number() over (partition by user_id order by created_at) as seqnum_asc,
             row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
             count(*) over (partition by user_id) as cnt
      from tests
     ) t
where cnt >= 100
group by user_id
having max(created_at) >= timestamp_add(min(created_at),  interval 1 year);

The function in the having clause could be timetamp_add(), datetime_add(), or date_add(), depending on the type of created_at.