I am currently building a data lake within S3 and have successfully moved data from a mysql DB to S3 using DMS. Within DMS I chose the option 'Migrate existing data and replicate ongoing changes'. I transformed the data to csv, but in the future will probably choose parquet. This method creates an initial csv file that has all the raw data from the database tables. It then creates subsequent csv files with inserts, deletes, updates.
Now I am looking for a strategy to copy the data from S3 into Redshift. I am looking for a strategy to copy the bulk data and copy the continual changes from S3 into Redshift. I will likely need to aggregate and summarize much of this data. Preferably I'll use AWS Glue, which uses Python. This ETL process will have to read from csv files in S3 and know to ignore files that have already been processed.
How can I facilitate this? Any ideas are welcome. Ideally I'm hoping for some Python (or Java) scripts that precisely do the process as described.
Thanks.