I want to create a view in BigQuery that would allow for the following:
- Deduplication of records
- Expose _PARTITIONTIME as a column to use for filtering to allow for partition pruning
I'm aware that the suggested query to do a de-duplication is:
SELECT
* EXCEPT(row_number)
FROM (
SELECT *
,ROW_NUMBER()
OVER (PARTITION BY fct.KeyColumn ORDER BY fct.CreatedDatetime DESC) row_number
FROM `my-project.my_dataset.fact_table` fct)
WHERE
row_number = 1
And the suggested way on how to query a partitioned table is:
SELECT
*
FROM
`my-project.my_dataset.fact_table`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
AND TIMESTAMP('2016-01-02');
How can I marry the two to provide a BigQuery view that is deduped and provides a column to allow the use of _PARTITIONTIME in a where clause in the view. I read about the limitations with partition pruning if using subselects etc. which is why I'm wondering whether this is possible at all.
To give you a bit of context, this view is on top of a table that is being streamed to so the dedupe has to happen on the fly.