There is a users
table present in a MySQL database.
We want to migrate the data into AMazon S3 for further analysis using Amazon Redshift.
- Day1 - Export 10 rows of data from users table (Total Rows: 10)
- Day2 - Export 2 additional rows of data from users table (Total Rows: 12)
- Day3 - Export a modified row and a new row of data from users table (Total Rows: 13)
Proposed S3 data folder/naming conventions:
s3://data/users/YYYYMMDD/users-YYYYMMDDHHMMSS.csv
Amazon Redshift loads data as follows for querying all rows from users:
create EXTERNAL TABLE redshift_users from s3://data/users
Questions:
- We are assuming the modified row on Day3 will take precedence from the data available on Day1?
- What if 2 existing rows are deleted on Day4, how should this be stored in S3 which can be understood by Redshift?