Smart sampling with AWS Glue Crawlers

Question

I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.

Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.

The schema doesn't change often so it is not necessary to go through all of the files on S3.

Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?

Are you using Glue Crawler for 'schema change detection' or only for 'new partition detection' for newly added dataset? If it is for 'new partition detection' for newly added dataset after first use of crawler for 'schema detection', you can use Athena Boto3 APIs to add partition without running crawler at all. — Sandeep Fatangare
This is a nice trick, I used it in the past. But schema changes happen every so often. — dmigo
Then it is going to be tricky as you need to be aware to take all schema changes in account during crawler otherwise crawler will not detect all schema changes in turn making Glue Data Catalog messy. If you wish to do selective crawling, as stated by @Eman, you can use exclude path (Unfortunately Glue doesn't provide include path :( ) But while doing to so, you must include all paths which may have schema changes ..e.g. schema changed happened of 10,15,20 day of Oct 2019 then it must be included in crawler path and crawler must crawl over these dataset everytime. continue...1/2 — Sandeep Fatangare
To be honest, it nullifies the purpose of crawler as you have to know schema change explicitly. ... 2/2 — Sandeep Fatangare
Ideally I would imagine that the crawler would go through 5% of randomly selected files. That would cover all the changes with relatively high probability, while reduce the cost of scanning significantly. — dmigo

Eman Eman · Accepted Answer · 2019-10-21T21:11:20

Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.

Smart sampling with AWS Glue Crawlers

1 Answers