Ignore duplicate records while appending in BigQuery

Question

We are writing the data from MySql to BigQuery. We have set some indicators like

Insert - If the records is being added first time, then save it with 'I' in Indicator field
Update - If the record has some updated data, then save it with 'U' in the Indicator field and ignore duplicate records if not changed.

But in case of 'Update' it's writing duplicated records as well, which has not even changed. Following is the query we are currently using to insert the data into BigQuery table. What changes can we made to this query?

"insert into `actual_table` 

(
    Id,
   ...
)
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id"

Actual table is the table in BigQuery whereas temp table is the staging table on bigquery. Everytime we read data from MySql, we store it in temp table.

Thanks

Hi Aneela, Can you provide example data set you are running on. Another option I like with BigQuery is doing the insert using merge DML command as follow: cloud.google.com/bigquery/docs/reference/standard-sql/…. It's quite a neat solution if this suite your use case. — Tamir Klein
Glad you find it useful, I post it as an answer and will be great if you can accept it and vote it up, Good luck with this. — Tamir Klein

Tamir Klein Tamir Klein · Accepted Answer · 2019-03-19T10:04:54

Another option I like with BigQuery is doing the inserts using merge DML, It's quite a neat solution if this suite your use case. You can see more details in this link.

Sql Example:

MERGE
    `mytable` as tgt
USING
    `mytable` as src
ON FALSE
WHEN NOT MATCHED AND src._PARTITIONTIME = '2019-02-21'
THEN INSERT (_PARTITIONTIME, fields...) VALUES (_PARTITIONTIME, fields...)
WHEN NOT MATCHED BY SOURCE AND tgt._PARTITIONTIME = '2019-02-21'
THEN DELETE

Ignore duplicate records while appending in BigQuery

2 Answers