3
votes

I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.

Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.

Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?

1

1 Answers

3
votes

Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.

Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.