Sync CSV files from Dynamic path in Glue Catalog and Glue Py Spark Script

Question

I have stored CSV files in AWS s3 daily basis. Bellow is my S3 file path structure:

s3://data-dl/abc/d=2019-09-19/2019-09-19-data.csv

In this structure, the date part of the s3 file path will be generated every day.

Now I want to use AWS glue for ETL to ship data from S3 to Redshift. To use this how can I add S3 path in the data catalog? I want to sync recent folders CSV file only.

Also for Job part how can I declare this dynamic path in Glue Pyspark script?

Harsh Bafna Harsh Bafna · Accepted Answer · 2019-09-20T05:10:18

Populating Glue Catalog

You can create an external table in athena partitioned by your date column. Then execute MSCK repair table command to update the partition information in the table whenever the new data is added to S3.

This will keep your glue catalog up-to-date with all the latest data.

Reference AWS documentation:

Create External Table

MSCK repair table to update partitions

Reading one day data in Glue ETL

You can create a dynamic frame from catalog in glue using the table created in above step. You can also use "push_down_predicate" parameter to read only one day's record while creating the dynamic frame.

Reference AWS documentation:

Create dynamic frame from catalog

Sync CSV files from Dynamic path in Glue Catalog and Glue Py Spark Script

2 Answers

Populating Glue Catalog

Reading one day data in Glue ETL