1
votes

I've read the AWS glue docs re: the crawlers here: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html but I'm still on unclear on what exactly the Glue crawler does. Does a Crawler go through your S3 buckets, and create pointers to those buckets?

When the docs say "The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog" what is the purpose of these metadata tables?

2

2 Answers

2
votes

The CRAWLER creates the metadata that allows GLUE and services such as ATHENA to view the S3 information as a database with tables. That is, it allows you to create the Glue Catalog.

This way you can see the information that s3 has as a database composed of several tables.

For example if you want to create a crawler you must specify the following fields:

Database --> Name of database Service role service-role/AWSGlueServiceRole Selected classifiers --> Specify Classifier Include path --> S3 location

2
votes

Crawlers are needed to analyze data in specified s3 location and generate/update Glue Data Catalog which is basically is a meta-store for actual data (similar to Hive metastore). In other words it persists information about physical location of data, it's schema, format and partitions which makes it possible to query actual data via Athena or to load it in Glue jobs.

I would suggest to read this documentation to understand Glue crawlers better and of course make some experiments.