This is expected behavior for clustered tables - once the query runner has opened enough clusters to produce the desired results, it will stop scanning and charge only for the bytes scanned so far.
For example:
SELECT * WHERE x=1 LIMIT 1
will stop after it finds the first one, if x is not clustered, and will do a full table scan if there's no one.
Actual queries over the same table:
SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%hoffa%'
LIMIT 1
Query complete (2.1s elapsed, 111 MB processed)
Query complete (1.8s elapsed, 126 MB processed)
Query complete (1.9s elapsed, 114 MB processed)
SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%khoffa%'
LIMIT 1
Query complete (2.9s elapsed, 2.52 GB processed)
Query complete (3.0s elapsed, 1.19 GB processed)
Query complete (1.9s elapsed, 114 MB processed)
SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%fhoffa%'
LIMIT 1
Query complete (5.1s elapsed, 188 GB processed)
Query complete (5.1s elapsed, 188 GB processed)
Query complete (5.1s elapsed, 188 GB processed)
'%hoffa%'
scans less data because there is a high chance of finding it on the first cluster opened.
'%khoffa%'
is harder to find, hence sometimes many clusters need to be opend - but you might get lucky and find it on the first one.
'%fhoffa%'
is not there, so BQ has to open every cluster in case it was.
- I added
RAND()
to ensure no caching.
- Big win: Previously BigQuery used to charge 188 GB for queries like this, but now it can charge 0.05% of that in cases like this.
Note that if you force BigQuery to scan every cluster, then it's back to 188GB. For example if you want to find the top %hoffa%
(instead of just the first one):
SELECT *, RAND()
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE datehour >= "2017-12-01"
AND title LIKE '%hoffa%'
ORDER BY views DESC
LIMIT 1
Query complete (5.5s elapsed, 188 GB processed)
Read more about the advantages of clustered tables: