Daily incremental copying from Amazon S3 data into Amazon Redshift

Question

I have a RDS database whose snapshot is taken everyday and is kept in a S3 bucket. I copy the RDS snapshot data from S3 to Amazon Redshift database daily. I can use copy to copy the tables but instead of copying the whole table, I want to copy only the rows which were added since the last snapshot was taken(Incremental copying).

For example, in RDS, there is a table name "user" which looks like this at 25-05-2021

id | username
1  | john
2  | cathy

When I will run the data loader for first time on 26-05-2021, it will copy these two rows into the Redshift table with the same name.

Now on 26-05-2021, the table in RDS looks like this:

id | username
1  | john
2  | cathy
3  | ola
4  | mike

When I will run the data loader on 27-05-2021, instead of copying all three rows, I want to copy/take only the rows which has been newly added(id = 3 and id = 4) as I already have the other rows.

What should be the best way of doing this incremental loading?

Just a quick question before I think about an answer : what are you doing about VACUUM? — Max Ganz II
@MaxGanzII, do you mean what shall I do in case of vacuuming RDS? If yes, that doesn't change anything in Redshift as I have different TTL strategy for warehouse. — Rafiul Sabbir
No, I mean vacuum in Redshift. You are loading records into a table. They are, unless you've taken special steps, loaded unsorted. You need to sort them (or there's no point in using Redshift). Sorting is intimately bound up in loading, in part because only one vacuum can run at a time on a cluster. To answer a question about loading you have to also discuss vacuum. — Max Ganz II

John Rotenstein John Rotenstein · Accepted Answer · 2021-05-29T06:04:00

The COPY command will always load the entire table. However, you could create an External Table using Redshift Spectrum that accesses the files without loading them into Redshift. Then, you could construct a query that does an INSERT where the ID is greater than the last ID used in the Redshift table.

Perhaps I should explain it a bit simpler...

Table existing_table in Redshift already has rows up to id = 2
CREATE EXTERNAL TABLE in_data to point at the files in S3 containing the data
The use INSERT INTO existing_table SELECT * FROM in_data WHERE id > (SELECT MAX(id) FROM existing_table

In theory, this should only load the new rows into the table.

Daily incremental copying from Amazon S3 data into Amazon Redshift

1 Answers