1
votes

I have a Google BigQuery table of 500,000 rows that I have setup to be partitioned by a TIMESTAMP field called Date and clustered by a STRING field called EventCategory (this is just a sample of a table that is over 500 million rows).

I have a duplicate of the table that is not partitioned and not clustered.

I run the following query on both tables:

SELECT 
  * 
FROM 
  `table_name`   
WHERE
   EventCategory = "email"

There are only 2400 rows where EventCategory is "email". When I run the query on the non clustered table I get the following:

enter image description here

When I run the query on the clustered table I get the following:

enter image description here

Here is the schema of both the non clustered and the clustered table:

Date             TIMESTAMP   NULLABLE
UserId           STRING      NULLABLE
EventCategory    STRING      NULLABLE
EventAction      STRING      NULLABLE
EventLabel       STRING      NULLABLE
EventValue       STRING      NULLABLE

There is basically no difference between the two queries and how much data they look through and I can't seem to figure out why? I have confirmed that the clustered table is partitioned and clustered because in the BigQuery UI in table details it actually says so and running a query by filtering by Date greatly reduces the size of the data searched and shows the estimated query size to be much smaller.

Any help here would be greatly appreciated!

UPDATE:

If I change the query to:

SELECT
  *
FROM
  `table_name`   
WHERE
  EventCategory = "ad"

I get the following result:

enter image description here

There are 53640 rows with EventCategory is "ad" and it looks like clustering did result in less table data being scanned, albeit not much less (529.2MB compared to 586MB).

So it looks like clustering is working but the data is not clustered properly in the table? How would I fix that? I have tried re-creating the table multiple times using DDL and even saving the table data to a JSON in GCS and then importing it into a new partitioned and clustered table but it hasn't changed anything.

Does the date partitioning sit on top of the clustering? Meaning that BigQuery first groups by date and then groups by cluster within those date groups? If so, I think that would probably explain it but it would render clustering not very useful.

3
could you show us the schema for your table? Also it could be that all your data is in the same cluster and the clustering is not working out.Pentium10
@Pentium10 I added the table schema into the question above. There are 5 different possible values in this table data of 500K rows for EventCategory. The one I am selecting for ("email") is the smallest with just 2800 rows where EventCategory is "email".mfratz
How "fragmented" this table? try to recreate the table with one-time select * into the same table and then run the againMikhail Berlyant
@MikhailBerlyant I did as you suggested but same result.mfratz
my last comment is to reply on y our update :o) also I would note - that clustering will help with really large volume of data - which is not exactly the case in your example (i think so) also if you are frequently updates your table - data gets 'fragmented' (from colocation point of view) and thus clustering is less effective - and requires defragmenting by (for example) select * into the same tableMikhail Berlyant

3 Answers

4
votes

If you have less than 100MB of data per day, clustering won't do much for you - you'll probably get one <=100MB cluster of data for each day.

You haven't mentioned how many days of data you have (# of partitions, as Mikhail asked), but since the total data scanned is 500MB, I'll guess that you have at least 5 days of data, and less than 100MB per day.

Hence the results you are getting seem to be the expected results.

See an example of this at work here:

1
votes

The reason clustering wasn't helping very much was specific to the table data. The table was event based data that was partitioned by day and then clustered by EventCategory (data is clustered on each day's partition). Since every day would have a large amount of rows for each EventCategory type, querying the entire table for a specific EventCategory would still have to search every single partition, which would then almost definitely have some data with that EventCategory meaning almost every cluster would have to be searched too.

0
votes
  • The data are partitioned by day and inside that they are clustered,
  • the clustering works best when you load whole partitions (days) at once or export the partition (day) to Google Storage (which should be free) and import it again to another table, when we tried loading something like 4GB JSONS the difference was something like 60/10.