BigQuery Cross Join Much Faster with Each

Question

I'm trying to look at user activity by date. The first step is to build a table of every day since a user account was created, using cross join and a where clause. My first attempt was this:

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from rsdw.user u
  cross join (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
where
  date(u.created) <= cal.date

(The calendar table is just a list of all dates since 2006 (3288 rows). The user table has ~1m rows.)

This query takes a forever... so long that I've abandoned it at 1000 seconds or so. I tried tweaking the query a little. If I add an "each" to the cross join:

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from rsdw.user u
  cross join each (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
where
  date(u.created) <= cal.date

I get an errror:

Error: Cannot CROSS JOIN two tables with EACH qualifiers.

Finally, if I keep the "each" but swap the tables, it completes in just 90s!

SELECT
  u.user_id as user_id,
  date(u.created) as signup_date,
  cal.date as date,

from (select date(dt) as date from [rsdw.calendar] where date(dt) < CURRENT_DATE() ) cal
  cross join each rsdw.user u
where
  date(u.created) <= cal.date

Can anyone explain why the third iteration is so much faster and why the second one results in an error?

Matthew Wesley Matthew Wesley · Accepted Answer · 2015-01-05T18:46:51

You're hitting some corner-case behavior with how joins interact with sub-select clauses. Joins against named tables benefit from some optimization based on table size, while sub-selects are unpredictable and can result in bad performance. I've filed an internal bug for us to improve this case.

In the first, slow case, your tiny sub-select of dates is copied and broadcast to a small number of machines each handling a large number of users. It takes forever as there is very little parallelism.

The second case is a query parsing error for internal reasons, basically it attempts to have machines handle both a small range of dates and a small range of users, which will not complete a cross join.

In the third, fast case, your tiny sub-select of dates is copied and broadcast to many machines, each handling a small range of users. It completes very fast due to the large parallelism.

Once we complete the bug I've filed, the third case behavior will hopefully happen automatically.

BigQuery Cross Join Much Faster with Each

2 Answers