Glue crawler created multiple tables from a partitioned S3 bucket

Question

I have an S3 bucket that is structured like this:

root/
├── year=2020/
│   └── month=01
│       ├── day=01 
|       |     ├──  file1.log
|       |     ├──  ...
|       |     └──  file8.log
│       ├── day=...
│       └── day=31 
|             ├──  file1.log
|             ├──  ...
|             └──  file8.log
└── year=2019/
        ├── ...

Each day would have 8 files with identical names across the days ─ there would be a file1.log in every 'day' folders. I crawled this bucket using a custom classifier.

Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.

Actual behavior:

1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.

2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log files

I have tried excluding **_SUCCESS and **crc like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.

What am I missing?

Sandeep Fatangare Sandeep Fatangare · Accepted Answer · 2020-01-16T06:39:52

You should have one folder at root (e.g. customers) and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.

Glue crawler created multiple tables from a partitioned S3 bucket

1 Answers