6
votes

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?

In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?

2

2 Answers

11
votes

If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.

However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.

Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.

The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.

1
votes

If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).

As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.

MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.