2
votes

I'm trying to pull user activity by date. I am trying to built a table of every day since a user account was created, using cross join and a where clause. In my case, cross join cannot be avoided. The calendar table is just a list of all dates for last 365 days (365 rows). The user table has ~1b rows.

Here is the query that fails with insufficient resources:

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,
from (select date(dt) as date from [dw.calendar] where date(dt) < 
CURRENT_DATE() ) cal
  cross join each dw.user u
where
  date(u.created) <= cal.date

Based on https://cloud.google.com/bigquery/query-reference, cross joins do not even support the "each" clause. How do I perform the above operation to successfully create a table?

1
Exact Error: Query Failed Error: Query exceeded resource limits for tier 1. Tier 6 or higher required.axrd
This will probably work better using standard SQL, though even in that case I suspect that you will need to enable a higher billing tier due to how many rows result from the cross join.Elliott Brossard
please explain - why you need it in first place? this just does nothing but explodes number of the rows w/o any value delivered! most likely you plan to treat it as an intermediate/temp table to do something with it than - so I would recommend rather to focus on final goal and not materialize that cross product stuff. unless - of course - you have good reason, so explain itMikhail Berlyant
I need to find out by aggregating user activity by day, week, month, year. I will need to calculate daily count and perform window function to get the aggregated sum. For each day I need a value, 0 if its missing. SELECT id, daily_count, SUM(daily_count) OVER (PARTITION BY id ORDER BY d DESC rows BETWEEN current row AND 6 following ) weekly_avg from (select date(dt) as date from [dw.calendar] where date(dt) < CURRENT_DATE() ) cal cross join each dw.user u where date(u.created) <= cal.dateaxrd
so, why not to update your question with your real target/goal? I think it makes more sense and will attract more answers than current abstract and intermediate step :o)Mikhail Berlyant

1 Answers

4
votes

You do not need to fill "empty" days to just calculate daily count and perform window function to get the aggregated sum, so you don't even need calendar table for this. To make this happen you need to use RANGE vs. ROWS in your window. See example below (for BigQuery Standard SQL)

#standardSQL
SELECT 
  user_id, created, daily_count,
  SUM(daily_count) OVER(
    PARTITION BY user_id ORDER BY created_unix_date DESC 
    RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING 
  ) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC

i am not sure about exact schema /types of your table so might need to adjust above respectively, but meantime you can test/play with below dummy data

#standardSQL
WITH `dw.user` AS (
  SELECT 
    day AS created,
    CAST(1 + 10 * RAND() AS INT64)  AS user_id,
    CAST(100 * RAND() AS INT64) AS daily_count
  FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-26')) AS day
)
SELECT 
  user_id, created, daily_count,
  SUM(daily_count) OVER(
    PARTITION BY user_id ORDER BY created_unix_date DESC 
    RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING 
  ) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC