I have an S3 bucket that is structured like this:
root/
├── year=2020/
│ └── month=01
│ ├── day=01
| | ├── file1.log
| | ├── ...
| | └── file8.log
│ ├── day=...
│ └── day=31
| ├── file1.log
| ├── ...
| └── file8.log
└── year=2019/
├── ...
Each day would have 8 files with identical names across the days ─ there would be a file1.log
in every 'day' folders. I crawled this bucket using a custom classifier.
Expected behavior: Glue will create one single table with year, month, and day as partition fields, and several other fields that I described in my custom classifier. I then can use the table in my Job scripts.
Actual behavior:
1) Glue created one table that fulfilled my expectations. However, when I tried to access it in Job scripts, the table was devoid of columns.
2) Glue created one table for every 'day' partitions, and 8 tables for every file<number>.log
files
I have tried excluding **_SUCCESS
and **crc
like people suggested on this other question: AWS Glue Crawler adding tables for every partition? However, it doesn't seem to work. I have also checked the 'Create a single schema for each S3 path' option in the crawler's setting. It still doesn't work.
What am I missing?