4
votes

I am new to the NoSql world. I am building a serverless app with dynamodb. In a relational DB when I would have 3 entities like post, post_likes and post_tags I would have few tables and use joins to fetch data. But, I wonder how should one make a NoSql structure for a scenario where post has one to many relationship with likes, and many to many with tags.

Post model:

user_id <string>
attachment_url <string>
description <string>
public <boolean>

Like model:

user_id <string>
post_id <string>
type <string>

Tag model:

name <string>

I have few access patterns:

  1. Get all public posts
  2. Get all posts filtered by a single tag and public status
  3. Get all posts by user id
  4. Get a single post by post id

And each time a post should be fetched with tags data, and likes data including user data that is attached to a like. In relational DB I would create post_tags table and fetch all post by tags. But, how can I do this with dynamodb?

I am struggling to figure out how my table should look like and what to set as primary and sort keys amongst post_id, user_id, tag_name or public fields for this case?

My initial thought was to build a table with entity that would look like this:

Partition key | Sort key | data attributes 
tag_name      | post_id  | public | user_id | likes[] | other post attributes...

Then this table would look something like this:

enter image description here

I have set the 2 Global secondary indexes. First Global secondary index:

partition key set to public and sort key to post_id

Second Global secondary index:

partition key set to user_id and sort key to post_id

That way for each tag a post has, I would have a duplicate of that post in the table. I thought by having a tag as a first filter, that way I could query efficiently posts if I need to query them by a tag.

enter image description here

But, if I do a query by just a public status or user_id, I would get all the duplicates of posts for each tag they belong to.

enter image description here

enter image description here

Or should I have 3 separate entities in the table, tags, posts and likes and if I fetch a post by a tag, I would first do one query to find all post_ids by a tag, then do the second query to fetch posts and their likes id, and then do the third query to fetch the likes array. I don't know what is the best practice when it comes to this things, since I only just started using dynamodb.

How should this DB structure look like then?

1
What primary keys have you tried so far? I may be mistaken, but your question about indexes implies that you are trying to create SQL-like indexes in DynamoDb. DynamoDB does have a concept of "secondary indexes", but it has no relationship with indexes found in SQL databases.Seth Geoghegan
I haven't done anything yet. And I wasn't maybe clear in the question, I just wasn't sure what to set as the hash and sort key, or just generally structure database for this case.Leff
Your 4th access pattern is "get a single post". Are you getting a single post by a post ID, a User ID, both or something else? You also mention modeling a Like entity, but don't describe using it in any of your access patterns. How do you plan on using the Like information? For example, does a Post have a like count? Do you need to track which user liked which post? Can you elaborate on how your application uses this info? Do you need additional access patterns like "Fetch likes per post" or "Fetch liked posts for a user"?Seth Geoghegan
Also, can you elaborate on tag usage? Do you want to be able to fetch a post by a single tag or an arbitrary number of tags? Tagging can be a hard problem to solve in DynamoDB, and might be best solved outside of DDB. Knowing exactly how you plan on using tags can help determine if it's the right fit for DDB. For example, listing the tags on a specific Post is straightforward. Fetching all Posts with a single tag is also straightforward. Fetching all Posts tagged with an arbitrary list of tags is harder.Seth Geoghegan
For the single post I was thinking of fetching it with a post id, that would be enough. For the Likes, Post would be fetched with Likes array associated to it, each like would have user information. As for the tags, I was thinking of a single tag, and fetching all the posts that have that tag.Leff

1 Answers

2
votes

You're off to a great start by thinking deeply about your access patterns and defining your entities (Posts, Users, Likes, etc). As you know, having a thorough understanding of your access patterns is critical to storing your data in DynamoDB.

While reviewing my answer, keep in mind that this is only one solution. DynamoDB gives you a ton of flexibility when defining your data model, which can be both a blessing and a curse! This answer is not meant to be the way to model these access patterns. Instead, it's one way that these access patterns can be implemented. Let's get into it!

I like to start by listing the entities we need to model, as well as the Primary key for each. Throughout this post, I'll be using composite primary keys, which are keys made up of a Partition Key (PK) and a Sort Key (SK). Let's start out with a blank table and fill it out as we go.

         Partition Key             Sort Key
User
Post
Tag

Users

Users are central to your application, so I'll start there.

Let's start by defining a User model that lets us identify a User by ID. I'll use the pattern USER#<user_id> for the PK and SK of the User entity.

User entities

This supports the following access patterns (examples in pseudocode for simplicity):

  1. Fetch User by ID
ddbClient.query(PK = USER#1, SK = USER#1)

I'll update the table with the new PK/SK pattern for Users

         Partition Key             Sort Key
User     USER#<user_id>           USER#<user_id>
Post
Tag

Posts

I'll start modeling Posts by focusing on the one-to-many relationship between Users and their Posts.

You have an access pattern to fetch All Posts by UserId, so I'll start by adding the Post model to the User partition. I'll do this by defining a PK of USER#<user_id> and an SK of POST#<post_id>.

Users and Posts

This supports the following access patterns:

  1. Fetch User and all Posts
ddbClient.query(PK = USER#<user_id>)
  1. Fetch User Posts
ddbClient.query(PK = USER#<user_id>, SK begins_with "POST#")

You may wonder about the odd-looking Post IDs. When fetching Posts, you'll probably want to get the most recent Posts first. You also want to be able to uniquely identify Posts by ID. When you have this sort of requirement, you can use a KSUID as your unique identifier. Explaining KSUID's is a bit out of scope for your question, but know that they are unique and sortable by the time they were created. Since DynamoDB sorts results by the Sort Key, your query for a user's posts will automatically be sorted by creation date!

Updating the PK/SK patterns for your application, we now have

         Partition Key             Sort Key
User     USER#<user_id>           USER#<user_id>
Post     USER#<user_id>           POST#<post_id>
Tag

Tags

We have a few options on how to model the one-to-many relationship between Posts and Tags. You could include a list attribute on your Post item, which simply lists the number of tags on the item. This approach is perfectly fine. However, looking at your other access patterns, I'm going to take a different approach for now (it will be apparent why later).

I will model tags with a PK of POST#<post_id> and an SK of TAG#<tag_name>

Post Tags

Since Primary Keys are unique, modeling tags in this way will ensure that no Post is tagged with the same Tag twice. Additionally, it allows us to have an unbounded number of Tags on a Post.

Updating our PK/SK table for Tag, we have

         Partition Key             Sort Key
User     USER#<user_id>           USER#<user_id>
Post     USER#<user_id>           POST#<post_id>
Tag      POST#<post_id>           TAG#<tag_name>

At this point we've modeled Users, Posts and Tags. However, we've only addressed one of your four access patterns. Lets see how we can use secondary indexes to support your access patterns.

Note: You could also model Likes in the exact same way.

Defining A Secondary Index

Secondary indexes allow you to support additional access patterns within your data. Let's define a very simple secondary index and see how it supports your various access patterns.

I'm going to create a secondary index that swaps the PK/SK patterns in your base table. This pattern is called an inverted index, and would look like this:

Inverted Secondary Index

All we've done here is swapped the PK/SK pattern of your base table, which has given us access to two additional access patterns:

  1. Fetch Post by ID
ddbClient.query(IndexName = InvertedIndex, PK = POST#<post_id>)
  1. Fetch Posts by Tag
ddbClient.query(IndexName = InvertedIndex, PK = TAG#<tag_name>)

Fetch All Posts by Public/Private status

You wanted to fetch posts by public/private status, as well as fetching all Posts. One way to fetch all Posts is to put them in a single partition. We can put the public/private status in the sort key to separate the public and private Posts.

To do this, I'll create two new attributes on the Post item: _type and publicPostId. These fields will serve as the PK/SK patterns for the secondary index I'm calling PostByStatus.

After doing this, your base table would look like this:

new Post attributes

and your new secondary index would look like this

Posts by public status

This secondary index would enable the following access patterns

  1. Fetch All Posts
ddbClient.query(IndexName = PostByStatus, PK = POST)
  1. Fetch All Private Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PRIVATE#")
  1. Fetch All Public Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PUBLIC#")

Remember, post ID's are KSUID's, so they will naturally be sorted in your results by the date the Post was made.

A Word on Hot Partitions

Storing all your Posts in a single partition will likely result in a hot partition as your application scales. One way to address this is by distributing your Post items across multiple partitions. How you do that is entirely up to you and specific to your application.

One strategy to avoid the single POST partition could involve grouping Posts by creation day/week/month/etc. For example, instead of using POST as your PK in the PostByStatus secondary index, you could use POSTS#<month>-<year> instead, which would look like this:

Avoiding hot partitions

Your application would need to take this pattern into account when fetching Posts (e.g. start at the current month and go backwards until enough results are fetched), but you'd be spreading the load across multiple partitions.

Wrapping Up

I hope this exercise gives you some ideas on how to model your data to support specific access patterns. Data modeling in DynamoDB takes time to get right, and will likely require multiple iterations to make work for your specific application. It can be a steep learning curve, but the payoff is a solution that brings scale and speed to your application.