0
votes

I have a tsv file in s3 and I am evaluating Athena query scanned size with and without partitioning. The query is running fine without any partition and scanning full data. When I am trying to partition, I am not able to load the partition. The docs weren't very helpful in understanding how to load partition for tsv data.

S3 data example:

column1 column2 US  column4
column1 column2 US  column4
column1 column2 DE  column4
column1 column2 DE  column4
column1 column2 US  column4
column1 column2 US  column4
column1 column2 IT  column4
column1 column2 IT  column4

I want to partition by the 3rd column since this is the one I will be query a lot by. Running MSCK REPAIR TABLE apparently didn't load the partitions. How to achieve this?

1
Good article explaining the benefits of partitioning with Amazon Athena: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog - John Rotenstein

1 Answers

1
votes

In Athena, partitions need to be separated into S3 folders. Partitioning by arbitrary data columns is not supported.

So to use partitions you can either set the neccessary column as name/value pair as folder name or explicitly map folders into partitions.

Both scenarios are explained here:

https://docs.aws.amazon.com/athena/latest/ug/partitions.html