0
votes

I have stored CSV files in AWS s3 daily basis. Bellow is my S3 file path structure:

s3://data-dl/abc/d=2019-09-19/2019-09-19-data.csv

In this structure, the date part of the s3 file path will be generated every day.

Now I want to use AWS glue for ETL to ship data from S3 to Redshift. To use this how can I add S3 path in the data catalog? I want to sync recent folders CSV file only.

Also for Job part how can I declare this dynamic path in Glue Pyspark script?

2

2 Answers

0
votes

Populating Glue Catalog

You can create an external table in athena partitioned by your date column. Then execute MSCK repair table command to update the partition information in the table whenever the new data is added to S3.

This will keep your glue catalog up-to-date with all the latest data.

Reference AWS documentation:

Create External Table

MSCK repair table to update partitions

Reading one day data in Glue ETL

You can create a dynamic frame from catalog in glue using the table created in above step. You can also use "push_down_predicate" parameter to read only one day's record while creating the dynamic frame.

Reference AWS documentation:

Create dynamic frame from catalog

0
votes

If you just want to sync, you dont need etl. You can use copy command from redshift to sync. You can run python shell job at scheduled interval or write lambda/sns with s3 event to trigger as soon as all files land in s3.