2
votes

We are writing the data from MySql to BigQuery. We have set some indicators like

  • Insert - If the records is being added first time, then save it with 'I' in Indicator field
  • Update - If the record has some updated data, then save it with 'U' in the Indicator field and ignore duplicate records if not changed.

But in case of 'Update' it's writing duplicated records as well, which has not even changed. Following is the query we are currently using to insert the data into BigQuery table. What changes can we made to this query?

"insert into `actual_table` 

(
    Id,
   ...
)
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id"

Actual table is the table in BigQuery whereas temp table is the staging table on bigquery. Everytime we read data from MySql, we store it in temp table.

Thanks

2
Hi Aneela, Can you provide example data set you are running on. Another option I like with BigQuery is doing the insert using merge DML command as follow: cloud.google.com/bigquery/docs/reference/standard-sql/…. It's quite a neat solution if this suite your use case.Tamir Klein
Thanks Tamir. It sounds like a great solution.Aneela Saleem Ramzan
Glad you find it useful, I post it as an answer and will be great if you can accept it and vote it up, Good luck with this.Tamir Klein

2 Answers

4
votes

Another option I like with BigQuery is doing the inserts using merge DML, It's quite a neat solution if this suite your use case. You can see more details in this link.

Sql Example:

MERGE
    `mytable` as tgt
USING
    `mytable` as src
ON FALSE
WHEN NOT MATCHED AND src._PARTITIONTIME = '2019-02-21'
THEN INSERT (_PARTITIONTIME, fields...) VALUES (_PARTITIONTIME, fields...)
WHEN NOT MATCHED BY SOURCE AND tgt._PARTITIONTIME = '2019-02-21'
THEN DELETE
0
votes

I suspect that it is not possible your code insert duplicate field (Id and field1 are same) with 'U' indicator as you mentioned in your code, Your code must return an error if there is a duplicate field because there will be no data in "Indicator" field and column count will mismatch, put an else field to case and use another select query to filter fields with "U" or "I" Indicator If Indicator field is not necessary, use merge command...

"insert into `actual_table` 

(
    Id,
   ...
)
select * from
(
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
else null 
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id
)
where Indicator is not null
"