1
votes

I have a defined a partitioned table which points to a S3 bucket which uses date partitioning. I have data for the past 3 months in the S3 bucket. I have loaded the partitions for the 1st month. However I haven't loaded the data in partition using msck repair table or alter table commands for the past 2 months. When I try to query the table , data for the past 2 months are not loaded from S3 , only the most recent partitioned data is showing up in the query results. Is this expected? If so , why?

I tried to create another partitioned table for the same s3 bucket but this time around I did not load any of the partitions. When I query this table , I get the most recent records.

1

1 Answers

3
votes

Yes it is expected.

Athena uses metadata to recognize data in S3. Most important metadata used to detect data in S3 is partition. Athena keeps details about all partitions in metadata. Using this partition info, it reaches to corresponding folder in S3 to fetch data.

  1. If you add more files to same partition: If partition is already added in athena metadata, all new files will be detected automatically because athena reads all files from folder in S3 by using partition metadata and s3 location.
  2. If u add files in new partition: if partition is not in athena metadata, athana doesn't know how to locate corresponding folder in S3. Therefore, it doesn't access data from that folder.

There are three ways to recognize new partitions: 1. Run Glue crawler over S3 bucket and it will refresh partition metadata. 2. Use alter table command in athana to add new partitions 3. Use msck repair table if your partition has different schema than table schema.