0
votes

I have a DataPipeline that exports data from a local DB to Redshift via S3 (very similar to Incremental copy of RDS MySQL table to Redshift template). I have defined primary key and set insertMode to "OVERWRITE_EXISTING" in pipeline definition, however, I noticed that some rows eventually were duplicated. In what cases does it happen and how do I prevent it?

2

2 Answers

1
votes

In Redshift it wont enforce primary key to restrict duplicate values. We do use temp table to load incremental data then we do upsert(using merge) to target table by checking whether record exist or not.

In this way you can achieve.

Thanks!!

0
votes

Just found this post after several years, adding an answer in case it helps someone else:

In addition to primary keys Redshift also uses distkeys to determine which lines to overwrite. So in my case an updated value in distkey column forced Redshift to create a duplicate row, although the primary key remained unchanged.