How to improve query performance to s3 data from Athena

Question

I have partitioned data stored in S3 in hive format like this.

bucket/year=2017/month=3/date=1/filename.json
bucket/year=2017/month=3/date=2/filename1.json
bucket/year=2017/month=3/date=3/filename2.json

Every partition has around 1,000,000 records. I have created table and partitions in Athena for this.

Now running query from Athena

select count(*) from mts_data_1 where year='2017' and month='3' and date='1'

this query is taking 1800 seconds to scan 1,000,000 records.

So my question is how can I improve this query performance?

How many files and bytes of data is Athena scanning in that query? — James
It scanning around 1000k file (around 250MB data). Every file has one JSON record. — Shailendra

James James · Accepted Answer · 2017-03-29T04:20:39

I think the problem is that Athena has to read so many files from S3. 250 MB isn't so much data, but 1,000,000 files is a lot of files. Athena query performance will improve dramatically if you reduce the number of files, and compressing the aggregated files will help some more. How many files do you need for one day's partition? Even with one-minute resolution, you would need less than 1,500 files. If the current query time is ~30 minutes, you might easily start with a lot less.

There are many options for aggregating and compressing your records:

AWS's Kinesis Firehose is a fairly simple way to start on exactly this sort of problem.
A streaming data processing tool like Apache NiFi would offer a richer set of tranformation, aggregation, and compression options. I've written a blog post about using Apache NiFi to stream data to S3 for Athena, covering these same issues.

How to improve query performance to s3 data from Athena

1 Answers