AWS DynamoDB - To use a GSI or Scan if I just wish to query the table by Date

Question

I feel like I'm thinking my self in circles here. Maybe you all can help :)

Say I have this simple table design in DynamoDB:

Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated

Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).

Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.

Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).

Assume a single entity will be performing this 'read' use case (no need for large read capacity).

Options I can think of:

Do a scan.

Pro: can leverage parallels scans to scale in the future.

Con: wastes a lot of money reading rows that do not match the filter criteria.

Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).

Pro: Only queries the rows matching the filter. No wasted money.

Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.

Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.

Before anyone can answer this question, please list ALL of your access patterns. It may be that your current partition / indexing is incorrect or could be adjusted with different indexes or sparse indexing — Alex Bailey
Hey Alex, good call out. That is my only access pattern for this table. I updated my question for clarity. — mBrice1024

Alex Bailey Alex Bailey · Accepted Answer · 2020-12-30T18:33:17

Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:

PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created

PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created

By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.

You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.

A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.

You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.

Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.

PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.

AWS DynamoDB - To use a GSI or Scan if I just wish to query the table by Date

2 Answers