BigQuery Cluster field usage/value not clear

Question

I created a table with a cluster filed but I don't see any saving or any performance improvement, this is what I have done:

I created a destination table with 3 columns: projectId, tableId and schema using this SQL:

SELECT projectId, tableId, schema 
FROM `project.dataset.tables` 
WHERE _partitionTime >= '2018-12-27 00:00:00'

Partition Field: Default partitionTime Cluster Field: projectId, tableId

The original cost of this sql is: $2.82

Now When selecting from the new table I expect

To get lower cost
To get better performance

I'm using this SQL

SELECT * FROM `project.table.testCluster` 
WHERE  projectId = 'xxx' and tableId = 'yyy' 
AND _PARTITIONTIME >= TIMESTAMP("2018-12-30") LIMIT 1000

From my benchmark and from BigQuery console execution report I see neither

Any ideas why?

Your screenshot says 578GB query was reduced to 346 MB processed. I think you just missed that. — Pentium10
Looks Like dryRun API and Run API returns different bytes process 620558572756 Vs 363303416 maybe this is where my problem is coming from. However Still not sure about what I see — Tamir Klein
This is by design like this. DryRun cannot return an optimized query costs, and will return the full scan numbers. Always trust the stats after the query has been executed. — Pentium10
Thanks, do you know what's the meaning of the comma separate in the cluster field? — Tamir Klein
You can have 5 dimensions for clustering (eg: continents, countries, regions, countyies, cities), for practical examples read this blog post: medium.com/google-cloud/… — Pentium10

Asif Al Hye Asif Al Hye · Accepted Answer · 2020-06-24T19:38:59

BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks. When you submit a query that contains a filter on a clustered column, BigQuery uses the clustering information to efficiently determine whether a block contains any data relevant to the query.

This allows BigQuery to only scan the relevant blocks — a process referred to as block pruning.

One small catch here. BigQuery provides an estimate for how much data each query will query before running the query. Without clustering, said estimate is exact. With clustering the estimate is an upper bound, and the query might end up querying less or may remain the same. It depends on the structure of the clustered column. The higher the unique values in the clustered column, lower the optimization.

BigQuery Cluster field usage/value not clear

1 Answers