BigQuery - Delete with two column in where and in clause

Question

I have a data steaming from the OLTP database to BQ. So if any row gets updated, BQ contains both old and new records in 2 different rows. So I have a query to de-dup the data, but I want to remove the old values from BQ at least once a day.

Here is my select query which will give me the latest records.

SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl`
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

id - integer (primary key)
sdc_sequence- unix timestamp (when the data inserted into BQ)
_sdc_batched_at timestamp - (its a batch steaming, so timestamp of the batch start time)

sample data of the above columns:

select id, 
_sdc_sequence, 
_sdc_batched_at 
FROM `my-production.bqtest.mytbl`

ID: 2741332
_sdc_sequence: 1565726907840002084
_sdc_batched_at: 2019-08-13 21:01:07.687 UTC

I want to delete the old records, I can do daily table rotation with latest rows, but the ETL tool which Im using will not work if I change something on the table structure.

I tried this below query, but it removes some valid rows as well.

delete from  `my-production.bqtest.mytbl` where _sdc_sequence not in( 
SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl` 
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

Because I have 2 rows with same sequence id, so I need to filter it out with where _sdc_sequence not in + max(_sdc_batched_at)

Or any other better query to perform this.

Platform's such as Google Bigquery are not great performant solutions when it comes to hard deletes. Have you considered a soft delete solution i.e. sump all versions of incremental feed in big query and create specific views which use partition functions such as Row Number to suppress all rows which are not latest version? — Raunak Jhawar
Yes, even Im not happy to do this, but, its the requirement and I need to implement. That's why Im doing this daily once. Then it'll reduce my query execution time. — TheDataGuy
The requirement to mutate the target BQ table with deletes would also mean that you lose all history of not recent records every time you run hard deletes vis-a-vis, with a materialized view, you stand to gain full data lineage of all stream inserts completed on your target BQ table. — Raunak Jhawar

Mohammed Sherif KK Mohammed Sherif KK · Accepted Answer · 2019-09-04T06:11:32

select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
          (select (t2.id, MAX(t2._sdc_sequence))
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

Use a () in the column list of the select query to avoid Subquery of type IN must have only one output column error

BigQuery - Delete with two column in where and in clause

4 Answers