Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

Question

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables).

However, it's not clear whether we need to run a crawler regularly to detect new data in a source (ie, new objects on s3, new rows in db table) if we know that there no scheme/partitioning changes.

So, is it required to run a crawler prior to running an ETL job to be able to pick up a new data?

RobinL RobinL · Accepted Answer · 2018-04-15T11:54:55

AWS Glue will automatically detect new data in S3 buckets so long as it's within your existing folders (partitions).

If data is added to new folders (partitions), you need to reload your partitions using MSCK REPAIR TABLE mytable;.

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

2 Answers