BigQuery: querying the latest partition, bytes to be processed vs. actually processed

Question

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:

SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)

But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:

SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE

But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.

What is the best practice to get the latest partition efficiently?

What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.

BigQuery screen with aparrently mistmatching processed bytes

Finally a couple of scenarios that I also tested:

If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.

How can I tell from the screenshot that the bytes are mismatched? — Felipe Hoffa
Have you read about partition pruning mechanism in Bigquery? — Nick_Kh
The screenshot shows the two mismatching values highlighted in yellow. I was expecting those values to be the same, maybe I'm wrong. The line in green is BigQuery's estimation for what you put in the editor and then in grey the actual processed bytes after executing. — PBurns
I have read about pruning but it only suggests that I shouldn't use subqueries to filter the partition field. Also, I'm using date or datetime fields and no pseudo columns are created for this type of partitioning. — PBurns
As querying the latest partition might be the most common task, I guess it worth to contribute and indicate your interest extending this functionality to Bigquery developers. Have you already visited this public tracker's thread? — Nick_Kh

PBurns PBurns · Accepted Answer · 2020-10-23T15:35:27

After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.

Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.

They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

1 Answers