0
votes

My recent experience seems to point out that BigQuery with join performs much slower than that with only one table. Normally most queries will finish in no more than a few minutes. But the following join query takes more than 6000 seconds to finish with table1 having ~ 1 billion rows and table2 ~ 1 million rows.

SELECT a.*, coalesce(b.field4, a.field3) 
FROM `table1` a left outer join `table2` b
on a.field1 = b.field1 and a.field2 = b.field2

Just want to ask the community if this is expected or need further investigation?

1
Did you check the query plan using the "Explanation" tab? It shows where the time went. - Elliott Brossard
yes, actually the bulk of the time went to Compute part of the last stage (Stage5) READ $80, $81, $82, $83, $84, ... FROM __SHUFFLE0 WRITE $80, $81, $82, $83, $84, ... TO __output - bignano
How many rows are being written as output? Can you maybe include a screenshot of the slow stage (or all of them) from the explanation in your question? - Elliott Brossard
sure, just added to my post - bignano
It does seem like the query shouldn't take as long to read/write a billion rows. You could try submitting a bug report to the issue tracker with a sample job ID to have someone from the BigQuery team take a look. - Elliott Brossard

1 Answers

0
votes

The solution is to specify a destination table. This will take down the total time to less than 200 seconds.