8
votes

I am using AWS Glue to create metadata tables.

AWS Glue Crawler data store path: s3://bucket-name/

Bucket structure in S3 is like

├── bucket-name        
│   ├── pt=2011-10-11-01     
│   │   ├── file1                    
|   |   ├── file2                                        
│   ├── pt=2011-10-11-02               
│   │   ├── file1          
│   ├── pt=2011-10-10-01           
│   │   ├── file1           
│   ├── pt=2011-10-11-10              
│   │   ├── file1  

                       

for this aws crawler create 4 tables.

My question is why aws glue crawler does not detect partition?

5

5 Answers

7
votes

To force Glue to merge multiple schemas together, make sure this option is checked, when creating the crawler - Create a single schema for each S3 path.

Screenshot of crawler creation step, with this setting enabled

Here's a detailed explanation - quoting directly, from AWS documentation (reference)

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors taken into account include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

You can configure a crawler to CombineCompatibleSchemas into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

If you are configuring the crawler on the console, to combine schemas, select the crawler option Create a single schema for each S3 path.

3
votes

Need to crawl a parent folder with all partition under it, otherwise the crawler will treat each partition as a seperate table. So example, create as such

s3://bucket/table/part=1
s3://bucket/table/part=2
s3://bucket/table/part=3

then crawl s3://bucket/table/

1
votes

Answer is:

Aws glue crawler before merging schema, first find similarity index of the schema(s). If similarity index is more than 70% then merge otherwise create a new table.

0
votes

Try to use table path like s3://bucket-name/<table_name>/pt=<date_time>/file. If after that a Crawler treat every partition like separate table, try to create the table manually and re-run Crawler to bring partitions.

0
votes

There are two things I needed to do to get AWS Glue to avoid creating extraneous tables. This was tested with boto3 1.17.46.

Firstly, ensure an S3 object structure such as this:

s3://mybucket/myprefix/mytable1/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable2/<nested_partition>/<name>.xyz
s3://mybucket/myprefix/mytable3/<nested_partition>/<name>.xyz

Secondly, if using boto3, create the crawler with the arguments:

targets = [{"Path": f"s3://mybucket/myprefix/mytable{i}/"} for i in (1, 2, 3)]
config = {"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas"}}

boto3.client("glue").create_crawler(Targets={"S3Targets": targets}, Configuration=json.dumps(config))
  • As per Targets, each table's path is provided as a list to the crawler.
  • As per Configuration, all files under each provided path should be merged into a single schema.

If using something other than boto3, it should be straightforward to provide the aforementioned arguments similarly.