slow performance of join in BigQuery

Question

My recent experience seems to point out that BigQuery with join performs much slower than that with only one table. Normally most queries will finish in no more than a few minutes. But the following join query takes more than 6000 seconds to finish with table1 having ~ 1 billion rows and table2 ~ 1 million rows.

SELECT a.*, coalesce(b.field4, a.field3) 
FROM `table1` a left outer join `table2` b
on a.field1 = b.field1 and a.field2 = b.field2

Just want to ask the community if this is expected or need further investigation?

Did you check the query plan using the "Explanation" tab? It shows where the time went. — Elliott Brossard
yes, actually the bulk of the time went to Compute part of the last stage (Stage5) READ $80, $81, $82, $83, $84, ... FROM __SHUFFLE0 WRITE $80, $81, $82, $83, $84, ... TO __output — bignano
How many rows are being written as output? Can you maybe include a screenshot of the slow stage (or all of them) from the explanation in your question? — Elliott Brossard
It does seem like the query shouldn't take as long to read/write a billion rows. You could try submitting a bug report to the issue tracker with a sample job ID to have someone from the BigQuery team take a look. — Elliott Brossard

bignano bignano · Accepted Answer · 2017-11-09T22:33:43

The solution is to specify a destination table. This will take down the total time to less than 200 seconds.

slow performance of join in BigQuery

1 Answers