8
votes

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

2
Did you try to put files in s single folder without partitioning and crawl it? Did you try to name partitions year=2018/month=01/day=01? How does your json files look like?Yuriy Bondaruk
My json is pretty simple {"x":"text","y":"text","z":"text"}. I have not tried naming partitions, would that cut down on the actual number of tables/partitions made? Can you name partition inline like you wrote when configuring the crawler? And no, my data already exists as an output of a live data pipeline, I have not reorganized. The folder structure is deliberate and I am not to mess with it.zachd1_618
There is no way to inline partition in crawler. Naming '<column-name>=<column-value>' is just more convenient way to work with partitions since they will be named properly in table instead of 'partition_0', 'partition_1' etc. I'm not sure if that affect crawler though. Your file looks ok, I had an issue when file contained array of json objects and it was fixed by adding a custom classifier.Yuriy Bondaruk

2 Answers

1
votes

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (e.g. year=2019/month=02/file.json) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

0
votes

Most times files with just one record create separate tables. I tried files with greater than 2 records and was able to group everything under one table with respective partitions.

How does your json files look like?