AWS Glue: Do I really need a Crawler for new content?

Question

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?

In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?

Yuriy Bondaruk Yuriy Bondaruk · Accepted Answer · 2018-11-03T12:33:46

If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.

However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.

Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.

The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.

AWS Glue: Do I really need a Crawler for new content?

2 Answers