I have BQ table that is partitioned by insert time. I'm trying to remove duplicates from the table. These are true duplicates: for 2 duplicate rows, all columns are equal - of course having a unique key might have been helpful :-(
At first I tried a SELECT query to enumerate duplicates and remove them:
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id_column) row_number
FROM
`mytable`)
WHERE
row_number = 1
This results in unique rows but creates a new table that doesn't include the partition data - so not good.
I've seen this answer here which states the only way to retain partitions is to go over them one-by-one with the above query and save to a specific target table partition.
What I'd really want to do is use a DML DELETE
to remove the duplicate rows in place. I tried something similar to what this answer suggested:
DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
FROM `mytable ` AS d2
WHERE d.id = d2.id) > 1;
But the accepted answer doesn't work and results in a BQ error:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
Would be great if anyone could offer a simpler (DML or otherwise) way to deal with this so I won't be required to loop over all partitions individually.