0
votes

I'm learning to use DynamoDB table and storing some job postings with info like date posted, company, and job title.

The query I use most is get all job posting greater than x date.

What partition key should I use so that I can do the above query without using a scan?

Partition key can only be checked for equality so using date as the partition key is no good. Date as the sort key seems best since I can query using equality on that.

However I'm a bit stuck on what is a good partition key to use then. If I put company or job title, I would have to include that as part of my query but I want ALL job postings after a certain date not just for specific company or job.

One way I thought of was using month as a partition key and date as the sort key. That way to get say last 14 days I know I need to hit the partition key of this month and maybe the last month. Then I can use the sort key to just keep the records within the last 14 days. This seems hackish tho.

1

1 Answers

1
votes

I would probably do something similar to what you mentioned in the last paragraph - keep a sub-part of the date as the partition key. Either use something like the month, or the first N digits of the unix timestamp, or something similar.

Note that, depending on how large partitions you choose you may still need to perform multiple queries when querying for, say, the last 14 days' of posts due to crossing partition boundaries (when querying for the last 14 days on January 4 you would want to query also for December of the previous year etc), but it should still be usable.

Remember that it's important to choose the partition key so that items are as evenly distributed as possible, so any hacks involving a lot of (or, as is sometimes seen in questions on SO: ALL!) items sharing the same partition key to simplify sorting is not a good idea.

Perhaps you might also want to have a look at Time-to-live to have AWS automatically delete items after a certain amount of time. This way, you could keep one table of the newest items, and "archive" all other items which are not frequently queried. Of course you could also do something similar manually by keeping separate tables for new and archived posts, but TTL is pretty neat for auto-expirying items. Querying for all new posts would then simply be a full scan of the table with the new posts.