BigQuery clustered tables: Bytes processed inconsistent across query runs

Question

For multiple runs of the same query, the number of bytes processed is variable (results change significantly, from 20GB to 30GB).

I'm running the same query, over the same table, getting the same result, multiple query runs, stats show different amounts of bytes processed by each run.

Is this expected?

Felipe Hoffa Felipe Hoffa · Accepted Answer · 2018-10-05T07:05:59

This is expected behavior for clustered tables - once the query runner has opened enough clusters to produce the desired results, it will stop scanning and charge only for the bytes scanned so far.

For example:

SELECT * WHERE x=1 LIMIT 1

will stop after it finds the first one, if x is not clustered, and will do a full table scan if there's no one.

Actual queries over the same table:

SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%hoffa%'
LIMIT 1

Query complete (2.1s elapsed, 111 MB processed)
Query complete (1.8s elapsed, 126 MB processed)
Query complete (1.9s elapsed, 114 MB processed)

SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%khoffa%'
LIMIT 1

Query complete (2.9s elapsed, 2.52 GB processed)
Query complete (3.0s elapsed, 1.19 GB processed)
Query complete (1.9s elapsed, 114 MB processed)


SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%fhoffa%'
LIMIT 1

Query complete (5.1s elapsed, 188 GB processed)
Query complete (5.1s elapsed, 188 GB processed)
Query complete (5.1s elapsed, 188 GB processed)

'%hoffa%' scans less data because there is a high chance of finding it on the first cluster opened.
'%khoffa%' is harder to find, hence sometimes many clusters need to be opend - but you might get lucky and find it on the first one.
'%fhoffa%' is not there, so BQ has to open every cluster in case it was.
I added RAND() to ensure no caching.
Big win: Previously BigQuery used to charge 188 GB for queries like this, but now it can charge 0.05% of that in cases like this.

Note that if you force BigQuery to scan every cluster, then it's back to 188GB. For example if you want to find the top %hoffa% (instead of just the first one):

SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%hoffa%'
ORDER BY views DESC
LIMIT 1

Query complete (5.5s elapsed, 188 GB processed)

Read more about the advantages of clustered tables:

https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

BigQuery clustered tables: Bytes processed inconsistent across query runs

1 Answers