I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.
|--data
| |--2018
| | |--01
| | | |--01
| | | | |--01
| | | | | |--file1.json
| | | | | |--file2.json
| | | | |--02
| | | | | |--file3.json
| | | | | |--file4.json
...
I then setup an AWS Glue Crawler to crawl s3://bucket/data
. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.
What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.
I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.
year=2018/month=01/day=01
? How does your json files look like? – Yuriy Bondaruk{"x":"text","y":"text","z":"text"}
. I have not tried naming partitions, would that cut down on the actual number of tables/partitions made? Can you name partition inline like you wrote when configuring the crawler? And no, my data already exists as an output of a live data pipeline, I have not reorganized. The folder structure is deliberate and I am not to mess with it. – zachd1_618