1
votes

I'm trying to look at user activity by date. The first step is to build a table of every day since a user account was created, using cross join and a where clause. My first attempt was this:

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from rsdw.user u
  cross join (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
where
  date(u.created) <= cal.date

(The calendar table is just a list of all dates since 2006 (3288 rows). The user table has ~1m rows.)

This query takes a forever... so long that I've abandoned it at 1000 seconds or so. I tried tweaking the query a little. If I add an "each" to the cross join:

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from rsdw.user u
  cross join each (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
where
  date(u.created) <= cal.date

I get an errror:

Error: Cannot CROSS JOIN two tables with EACH qualifiers.

Finally, if I keep the "each" but swap the tables, it completes in just 90s!

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
  cross join each rsdw.user u
where
  date(u.created) <= cal.date

Can anyone explain why the third iteration is so much faster and why the second one results in an error?

2

2 Answers

5
votes

You're hitting some corner-case behavior with how joins interact with sub-select clauses. Joins against named tables benefit from some optimization based on table size, while sub-selects are unpredictable and can result in bad performance. I've filed an internal bug for us to improve this case.

In the first, slow case, your tiny sub-select of dates is copied and broadcast to a small number of machines each handling a large number of users. It takes forever as there is very little parallelism.

The second case is a query parsing error for internal reasons, basically it attempts to have machines handle both a small range of dates and a small range of users, which will not complete a cross join.

In the third, fast case, your tiny sub-select of dates is copied and broadcast to many machines, each handling a small range of users. It completes very fast due to the large parallelism.

Once we complete the bug I've filed, the third case behavior will hopefully happen automatically.

0
votes

Based on https://cloud.google.com/bigquery/query-reference, cross joins do not even support the "each" clause, so I'm somewhat surprised that the third query even runs.

Cross joins form a Cartesian product (each record from the left table is joined with each record from the right query) - it is a very inefficient way of joining data (though at times cannot be avoided, such as the way you are trying to create your dataset). How many records do you have in the Users table? The EACH clause should only work when the left table is the smaller one (and under 8MB compressed). Your calendar table contains 3650 records, so if you have more users than that, it needs to be on the left side - like you did in the third query.