0
votes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:

{ 
  userId: "08074c7e0c0a4453b3c723685021d0b6",  // partition key
  email: "[email protected]",
  ... other attributes ...
}

When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.

But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.

Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?

  1. Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.

      { userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key itemTypeAndIndex: "main", // sort key email: "[email protected]", ... other attributes ... }

  1. If the user adds a second, third, etc. email, then add a new item for each email, like this:

      { userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key itemTypeAndIndex: "Email-2", // sort key email: "[email protected]" // no more attributes }

  1. The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.

  2. If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)

  3. If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.

  4. The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able

Is this a good way to do it? Is there a better way?

1
Justin, for searching on attributes I would strongly advice not to use DynamoDB. I am not saying, you can acheive this. However, I see a few problem that will eventually come in your path if you will go this root.mango
@mango - what problems do you foresee?Justin Grant
I accidently wrote my answer in comment. I have added a detailed answer in answer section. I hope that helps.mango

1 Answers

1
votes

Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.

  1. Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
  2. What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
  3. DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.

Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.


To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.

Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]

  1. Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
  2. Enable DynamoDB table streams on UserTable table.
  3. Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.

Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.

Option 2. Use AWS Aurora [Less preferred solution]

If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database. Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.



I will suggest for 1st option as:

  1. DynamoDB will provide you durable, highly available, low latency primary storage for your application.
  2. AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
  3. With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
  4. With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
  5. Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]

Having said that, now I will leave this up to you to decide. 😊