ETL Data from Amazon DMS to S3 to Redshift

Question

I am currently building a data lake within S3 and have successfully moved data from a mysql DB to S3 using DMS. Within DMS I chose the option 'Migrate existing data and replicate ongoing changes'. I transformed the data to csv, but in the future will probably choose parquet. This method creates an initial csv file that has all the raw data from the database tables. It then creates subsequent csv files with inserts, deletes, updates.

Now I am looking for a strategy to copy the data from S3 into Redshift. I am looking for a strategy to copy the bulk data and copy the continual changes from S3 into Redshift. I will likely need to aggregate and summarize much of this data. Preferably I'll use AWS Glue, which uses Python. This ETL process will have to read from csv files in S3 and know to ignore files that have already been processed.

How can I facilitate this? Any ideas are welcome. Ideally I'm hoping for some Python (or Java) scripts that precisely do the process as described.

Thanks.

why did you set up DMS to go from mysql->S3? rather than mysql->redshift? i.e. load all data to redshift "staging" area then run your data transformations directly on redshift? — Jon Scott
As I mentioned I am building a data lake. The raw data needs to stay in S3 which is the dedicated lake. — ChrisK
How did you handle the updates coming from mysql to S3 ? Are you merging it someway or what — Surajit Kundu

Joe Harris Joe Harris · Accepted Answer · 2017-11-13T15:28:51

Recommend setting up AWS Glue to automatically watch the S3 destination and convert the CSV into Parquet. - http://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html - https://github.com/awslabs/aws-glue-samples

Then use Amazon Redshift Spectrum to selectively load and/or query the data directly from the data lake. - http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

ETL Data from Amazon DMS to S3 to Redshift

2 Answers