Optimize Active Accounts Query with Start / End Date on Google BigQuery

Question

I'm looking to optimize a query to get active accounts per interval (day, week, etc) where the active is defined on a start and end date for the account. Layout is as below:

accounts: account_id Int subscription_start Timestamp subscription_end Timestamp -- null if still active

I tried several approaches, the fastest one being using generate_array to create arrays for active days on the fly and then unnesting them (about 2 minutes)

with nested_dates as (
  select GENERATE_DATE_ARRAY(DATE(subscription_start), IFNULL(date(subscription_end), current_date()), INTERVAL 1 DAY) as dates
  from `accounts`
),
all_dates as (
  select date_item from nested_dates, UNNEST(dates) as date_item
)
select date_item, count(1) from all_dates group by date_item

I also used a subselect per day, which used to perform quite well in similar situations 6-9 months ago when BigQuery still had the performance tiers and gave more lee way with CPU. But it seems to be much stricter / less efficient now as they removed the higher computing tier pricing. This execution takes about 12 minutes.

select
  day,
  (select 
     count(1) 
   from `accounts` where
     subscription_start <= day and 
     (subscription_end is null or subscription_end >= day)
  )
from unnest(
    generate_date_array(date('2015-06-01'), current_date(), interval 1 day)
) AS day

I also did performance tests on cross join of accounts with generates dates and a conditional sum (fails after 3300 seconds because of CPU limit exceeded).

What works very fast is of course when I generate a materialized table of all days for each account where it has been active, then queries finish within 6-10 seconds, if I'd implement partitioning probably even faster (e.g. by using the with from first query "all_dates".

select date(day_active), count(1) from `account_all_dates`

Now the question: Is there a way to get the performance of the materialized cross join without actually blowing up the data first and materializing it, i.e. in real time without the overhead.

I was trying with some analytic function, but could not find something that could do this.

Mikhail Berlyant Mikhail Berlyant · Accepted Answer · 2018-03-17T15:30:18

I think your first query is just perfect and is a way to go It is already most generic and works for any interval (day, week, etc) and off of all non-materialized options - the fastest one

The only potential improvement (and also it a little more compact) is below (in my quick testing with dummy data it constantly showed slightly faster result - but I cannot say it is for sure - would be interesting to see result on your real data)

#standardSQL
SELECT date_item, COUNT(1) active_accounts 
FROM `accounts`, 
UNNEST(GENERATE_DATE_ARRAY(DATE(subscription_start), 
  IFNULL(DATE(subscription_end), CURRENT_DATE()), INTERVAL 1 DAY)) date_item
GROUP BY date_item

Optimize Active Accounts Query with Start / End Date on Google BigQuery

1 Answers