I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).
I have a duplicate of the table that is not partitioned and not clustered.
I run the following query on both tables:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "email"
There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:
When I run the query on the clustered table I get the following:
Here is the schema of both the non clustered and the clustered table:
Date TIMESTAMP NULLABLE
UserId STRING NULLABLE
EventCategory STRING NULLABLE
EventAction STRING NULLABLE
EventLabel STRING NULLABLE
EventValue STRING NULLABLE
There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.
Any help here would be greatly appreciated!
UPDATE:
If I change the query to:
SELECT
*
FROM
`table_name`
WHERE
EventCategory = "ad"
I get the following result:
There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).
So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.
Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.