How to model S3 storage for query using AWS RedShift Spectrum

Question

There is a users table present in a MySQL database.

We want to migrate the data into AMazon S3 for further analysis using Amazon Redshift.

Day1 - Export 10 rows of data from users table (Total Rows: 10)
Day2 - Export 2 additional rows of data from users table (Total Rows: 12)
Day3 - Export a modified row and a new row of data from users table (Total Rows: 13)

Proposed S3 data folder/naming conventions:

s3://data/users/YYYYMMDD/users-YYYYMMDDHHMMSS.csv

Amazon Redshift loads data as follows for querying all rows from users:

create EXTERNAL TABLE redshift_users from s3://data/users

Questions:

We are assuming the modified row on Day3 will take precedence from the data available on Day1?
What if 2 existing rows are deleted on Day4, how should this be stored in S3 which can be understood by Redshift?

Jon Scott Jon Scott · Accepted Answer · 2019-06-13T18:06:32

To answer your questions:

You assume that a new row will take precedence automatically over an older row - this is NOT the case. You need to add a "processed_timestamp" column to your data and process that inside your query - your query must work out how to get the latest row only (I use window functions for this).
Usually you cannot physically delete rows sensibly from S3. You need to implement a logical deletion process in conjunction with #1 above.