How does BigQuery table works with Partition and Cluster field?

Question

When specifying TIMESTAMP column as partition - The data is saved on the disk by the partition allows each access.

Now, BigQuery allows to also define up to 4 columns which will used as cluster field.

If I get it correctly the partition is like PK and the cluster fields are like indexes.

So this means that the cluster fields has nothing to do with how records are saved on the disk?

See this answer which provides a good answer around this subject. Also this medium post is very good by @Felipe — Tamir Klein
@TamirKlein it doesn't. It talks about clustering fields regardless of partition field. — Luis

Tamir Klein Tamir Klein · Accepted Answer · 2019-04-10T09:52:23

If I get it correctly the partition is like PK

This is not correct, Partition is not used to identify a row in the table rather enable BigQuery to Store each partitioned data in a different segment so when you scan a table by Partition you ONLY scan the specified partitions and thus reduce your scanning cost

cluster fields are like indexes

This is correct cluster fields are used as pointers to records in the table and enable quick/minimal cost access to data regardless to the partition. This means Using cluster fields you can query a table cross partition with minimal cost

I like @Felipe image from his medium post which gives nice visualization on how data is stored.

Note: Partitioning happens on the time of the insert while clustering happens as a background job performed by BigQuery

How does BigQuery table works with Partition and Cluster field?

1 Answers