9
votes

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

2

2 Answers

6
votes

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

0
votes

It's necessary to run the crawler prior to the job.

The crawler replaces Athena MSCK REPAIR TABLE and also updates the table with new columns as they're added.