2
votes

After doing a bit of research, I see that since Redshift doesn't support merge/upsert some people are using staging tables to update/insert records. Since Redshift also doesn't support procedures (triggers, etc.) does anyone have suggestions for how they've automated this process(lambda, data pipeline, etc.)?

1

1 Answers

2
votes

I don't think there is one right answer to this. Broadly there are two basic approaches to this I know of.

  1. An explicit step in your ETL that's performed as you're loading data (using this pattern)
  2. A batch cronjob that handles upserts periodically

Option 1 is probably the best approach, I believe. AWS Data Pipeline supports this with the RedshiftCopyActivity by using the insertMode field set to OVERWRITE_EXISTING.

At my job we've done this manually in our data pipeline by running the load command to a staging table and the upsert within the same sql transaction.

The other approach (option 2) is to have your ETL just appends new data to a staging table (you can use AWS Firehose for something like this), which you then use with a scheduled cronjob to update the target table on a scheduled basis. The upside of this is that your ETL is simpler, but the cronjob becomes another point of failure.