1
votes

I have a data steaming from the OLTP database to BQ. So if any row gets updated, BQ contains both old and new records in 2 different rows. So I have a query to de-dup the data, but I want to remove the old values from BQ at least once a day.

Here is my select query which will give me the latest records.

SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl`
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch
  • id - integer (primary key)
  • sdc_sequence- unix timestamp (when the data inserted into BQ)
  • _sdc_batched_at timestamp - (its a batch steaming, so timestamp of the batch start time)

sample data of the above columns:

select id, 
_sdc_sequence, 
_sdc_batched_at 
FROM `my-production.bqtest.mytbl`

ID: 2741332
_sdc_sequence: 1565726907840002084
_sdc_batched_at: 2019-08-13 21:01:07.687 UTC

I want to delete the old records, I can do daily table rotation with latest rows, but the ETL tool which Im using will not work if I change something on the table structure.

I tried this below query, but it removes some valid rows as well.

delete from  `my-production.bqtest.mytbl` where _sdc_sequence not in( 
SELECT DISTINCT o._sdc_sequence
      FROM `my-production.bqtest.mytbl` o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM `my-production.bqtest.mytbl` 
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

Because I have 2 rows with same sequence id, so I need to filter it out with where _sdc_sequence not in + max(_sdc_batched_at)

Or any other better query to perform this.

4
Platform's such as Google Bigquery are not great performant solutions when it comes to hard deletes. Have you considered a soft delete solution i.e. sump all versions of incremental feed in big query and create specific views which use partition functions such as Row Number to suppress all rows which are not latest version? - Raunak Jhawar
Yes, even Im not happy to do this, but, its the requirement and I need to implement. That's why Im doing this daily once. Then it'll reduce my query execution time. - TheDataGuy
The requirement to mutate the target BQ table with deletes would also mean that you lose all history of not recent records every time you run hard deletes vis-a-vis, with a materialized view, you stand to gain full data lineage of all stream inserts completed on your target BQ table. - Raunak Jhawar

4 Answers

4
votes
select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
          (select (t2.id, MAX(t2._sdc_sequence))
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

Use a () in the column list of the select query to avoid Subquery of type IN must have only one output column error

0
votes

If you want to keep only one row for each id, then tuple syntax might be simplest. For the records to keep:

select t.*
from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) in
          (select t2.id, MAX(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

Based on your description, I'm not sure what the batch has to do with the problem, so I left it out.

You can turn this into a delete using either not in or similar logic:

delete from `my-production.bqtest.mytbl` t
where (id, _sdc_sequence) not in
          (select t2.id, MAX(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          );

You can also phrase this as:

delete from `my-production.bqtest.mytbl` t
where _sdc_sequence <
          (selectd max(t2._sdc_sequence)
           from `my-production.bqtest.mytbl` t2
           where t2.id = t.id
          );
0
votes

@Mr.Llama

you can avoid it by using a seperator

e.g. CONCAT(ABCD, EF) ---> CONCAT(ABCD,'-', EF)

-2
votes

Just concat the two fields, both in the where check and in the subquery, to turn them into a single one with unique values (make sure they're unique before running the delete):

select t.*
from `my-production.bqtest.mytbl` t
where concat(id, _sdc_sequence) in
          (select concat(t2.id, MAX(t2._sdc_sequence))
           from `my-production.bqtest.mytbl` t2
           group by t2.id
          )