I have a DynamoDB table where I'm aggregating CDN access logs. Specifically I want to track:
For a given customer (all of whose requests can be identified from the URL being downloaded), how many bytes were delivered on their behalf each day?
I have a primary partition key on customer and a primary sort key on time_bucket (day). This way given a customer I can say "find all records from March 1st, 2021 to March 31st, 2021" for instance. So far, so good
The issue arose when I wanted to start deleting old data. Anything older than 5 years should be dropped from the database.
Because the partition key isn't on time_bucket, there's no easy way to say "retrieve all records for May 25th, 2016". Doing so requires a scan instead of a query, and scans are out of the question (unusably slow given how much data I'm handling)
I don't want to swap the partition key and sort key for two reasons:
- When processing new data to add to the Dynamo table, all new CDN logs will be for the same day. This means that my table will be unbalanced: every write operation made during a single day will hit the same partition key
- If I wanted to pull a month's worth of data for a single customer I would have to make 30 queries -- one for each day of the month. This gets even worse when pulling a year of data, or 3 years of data
My first thought was "just add an index on the time_bucket column", but when I tried this I got an error:
Attribute Name is duplicated: time_bucket (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: PAN9FVSEMBBJT412NCV013VURNVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
It seems like DynamoDB does not allow you to create an index on the sort key. So what's the proper solution here?