BigQuery deduplication and partitioned table

Question

I want to create a view in BigQuery that would allow for the following:

Deduplication of records
Expose _PARTITIONTIME as a column to use for filtering to allow for partition pruning

I'm aware that the suggested query to do a de-duplication is:

SELECT
  * EXCEPT(row_number)
  FROM (
SELECT *
    ,ROW_NUMBER()
          OVER (PARTITION BY fct.KeyColumn ORDER BY fct.CreatedDatetime DESC) row_number
FROM `my-project.my_dataset.fact_table` fct)
WHERE
  row_number = 1

And the suggested way on how to query a partitioned table is:

SELECT
  *
FROM
  `my-project.my_dataset.fact_table`
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02');

How can I marry the two to provide a BigQuery view that is deduped and provides a column to allow the use of _PARTITIONTIME in a where clause in the view. I read about the limitations with partition pruning if using subselects etc. which is why I'm wondering whether this is possible at all.

To give you a bit of context, this view is on top of a table that is being streamed to so the dedupe has to happen on the fly.

Elliott Brossard Elliott Brossard · Accepted Answer · 2017-07-21T15:47:45

You would need to have the view return the latest row on a per-day basis to enable filtering on the partition time. It would be something like this:

#standardSQL
SELECT day, latest_row.*
FROM (
  SELECT
    _PARTITIONTIME AS day,
    ARRAY_AGG(fct ORDER BY fct.CreatedDatetime DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM `my-project.my_dataset.fact_table` AS fct
  GROUP BY day, KeyColumn
);

In the outer query, latest_row.* should also include KeyColumn.

BigQuery deduplication and partitioned table

1 Answers