AWS Glue & Deduplicating Incremental CSV Files

Question

I have csv files being delivered to S3 daily that are incremental for the month. So file1 contains data for day 1, file2 contains data for day 1 and day 2, etc, etc. Each day I'd like to run an ETL on that data and write it out to different S3 location so I can query it with Athena without duplicate rows being present. Essentially I just want to query the most recent state of the aggregate data (which would be just the contents of the most recently delivered file to S3).

I don't think bookmarks will work since the incremental deliveries contain the data in previous files, thus would produce duplicates. I know I can convert to a data frame and remove duplicates like this if I was running on all the files in the source bucket:

spark_df = resolvechoice2.toDF()
spark_df = spark_df.dropDuplicates()

But that seems like it would add a lot of processing for me to run the ETL on all of the data in the source table each time.

Basic workflow would be something like, new file gets delivered, maybe use Lambda to kick of the AWS Glue Job which processes the contents of just that file and then replaces the content of the output bucket. Output bucket is partitioned by year and month.

Would the easiest way to be to just enable bookmarks and delete all of the content in the output bucket on each run?

botchniaque botchniaque · Accepted Answer · 2018-07-25T11:08:59

If your files were in separate folders you could user Athena's EXTERNAL TABLE and everyday point to current partition (eg. using Lambda): - remove partition pointing to yesterday's folder - add partition pointing to today's folder - on month's end you leave the partition pointing to the last day (containing whole month's data).

This way you don't need any de-duplication process, you just manage your Athena partitions.

AWS Glue & Deduplicating Incremental CSV Files

1 Answers