AWS Glue Crawler query

Question

I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena. My scenario and question: I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently

Are the S3 object key partitions changing when you replace the file? And is your crawler configured to retrieve common object keys? — pkarfs

vamalik16 vamalik16 · Accepted Answer · 2020-05-12T21:53:53

When a glue crawler runs, the following actions take place:

It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog

The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.

You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

AWS Glue Crawler query

1 Answers