1
votes

I am modelling the data of my application to use DynamoDB. My data model is rather simple:

  • I have users and projects
  • Each user can have multiple projects

Users can be millions, project per users can be thousands.

My access pattern is also rather simple:

  1. Get a user by id
  2. Get a list of paginated users sorted by name or creation date
  3. Get a project by id
  4. get projects by user sorted by date

My single table for this data model is the following:

enter image description here

I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2. According to the documentation and best practices, to get a sorted list of paginated users:

  • I can't use a scan, as sorting is not supported
  • I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
  • I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above

I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.

My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.

What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.

Edit

Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all. As to the sizing, it needs to scale well, at least to the tens of thousands.

1
I'm not sure about the use case for 2), because an Admin-Panel for an app with millions of users that paginates through these seems impractical :-) - maybe there is another way to phrase your requirement that can be implemented more easily while still being sufficient for what you want to do.Maurice
we need two GSIs , one by_name with PK as 'name' , second by_date with PK as created_date and use query operation to paginateBalu Vyamajala
@Maurice, it does not need to be necessarily an admin panel, in my app the main use case is something like this: stackoverflow.com/users?tab=Reputation&filter=all. As to the sizing, it needs to scale well, at least to the tens of thousands.Lorenzo Polidori
@BaluVyamajala, how would this satisfy the access pattern 2)? To use the GSIs you are suggesting, the application would need to know the name and creation date of a user (as being PKs) and query by them. Did you mean SK? In which case, which PK would you use?Lorenzo Polidori

1 Answers

2
votes

I also thought that bucket PKs could be alphabetical letters, but that could create hot partitions as well, as the letter "A" would probably be hit quite hard.

I think this sounds like a reasonable approach.

The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.

Name Distribution

Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.

While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.

1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.

You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.