I have a data steaming from the OLTP database to BQ. So if any row gets updated, BQ contains both old and new records in 2 different rows. So I have a query to de-dup the data, but I want to remove the old values from BQ at least once a day.
Here is my select query which will give me the latest records.
SELECT DISTINCT o._sdc_sequence
FROM `my-production.bqtest.mytbl` o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM `my-production.bqtest.mytbl`
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
id- integer (primary key)sdc_sequence- unix timestamp (when the data inserted into BQ)_sdc_batched_attimestamp - (its a batch steaming, so timestamp of the batch start time)
sample data of the above columns:
select id,
_sdc_sequence,
_sdc_batched_at
FROM `my-production.bqtest.mytbl`
ID: 2741332
_sdc_sequence: 1565726907840002084
_sdc_batched_at: 2019-08-13 21:01:07.687 UTC
I want to delete the old records, I can do daily table rotation with latest rows, but the ETL tool which Im using will not work if I change something on the table structure.
I tried this below query, but it removes some valid rows as well.
delete from `my-production.bqtest.mytbl` where _sdc_sequence not in(
SELECT DISTINCT o._sdc_sequence
FROM `my-production.bqtest.mytbl` o
INNER JOIN (
SELECT id,
MAX(_sdc_sequence) AS seq,
MAX(_sdc_batched_at) AS batch
FROM `my-production.bqtest.mytbl`
GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
Because I have 2 rows with same sequence id, so I need to filter it out with where _sdc_sequence not in + max(_sdc_batched_at)
Or any other better query to perform this.