8
votes

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.

Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).

We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.

We have tried two different approaches, with two different data structures.

Approach 1

AWS DynamoDB - Pick a record/item randomly?

Help selecting nth record in query.

In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).

Issues with this approach:

  • First query can fail if it is greater than/less than any of the UUIDs
  • Querying on any specific category will cause throttling at scale (Small number of partitions)

We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.

AWS Database Blog Choosing the Right DynamoDB Partition Key

Approach 2

Amazon Web Services: How do we get random item from the dynamoDb's table?

Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.

By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.

Issues with this approach

  • We need a secondary data structure to manage the sequential counts across each category
  • Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.

Conclusion

Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.

Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).

Thanks!

2

2 Answers

2
votes

Follow Up

After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).

We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.

For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.

We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.

If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!

1
votes

This was too long for a comment, and I guess it's pretty much a full fledged answer now.

Approach 2

I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.

If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.

When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.

When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.

To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.

Approach 1

I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)