AWS Glue does not detect partitions and creates 1000+ tables in catalog

Question

I am using AWS Glue to create metadata tables.

AWS Glue Crawler data store path: s3://bucket-name/

Bucket structure in S3 is like

├── bucket-name        
│   ├── pt=2011-10-11-01     
│   │   ├── file1                    
|   |   ├── file2                                        
│   ├── pt=2011-10-11-02               
│   │   ├── file1          
│   ├── pt=2011-10-10-01           
│   │   ├── file1           
│   ├── pt=2011-10-11-10              
│   │   ├── file1

for this aws crawler create 4 tables.

My question is why aws glue crawler does not detect partition?

bhrd bhrd · Accepted Answer · 2019-05-04T19:34:11

To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler - Create a single schema for each S3 path.

Screenshot of crawler creation step, with this setting enabled

Here's a detailed explanation - quoting directly, from AWS documentation (reference)

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.

AWS Glue does not detect partitions and creates 1000+ tables in catalog

5 Answers