Amazon DynamoDB (many-to-many relationships)

Question

I am fairly new to NoSQL databases and I usually don't have an issue, but in my latest mobile app I am trying to use Amazon's DynamoDB and I have hit a wall.

Generally, I have no issues if the table contains unique, single entries. However, if the table contains a list of items associated with a specific attribute, I am not sure how to retrieve/query the data properly.

For example, below is a table. We will call it AWS_DDB_SESSIONS. It contains all the information about the user sessions (where SessionID is the hash key):

Assume that the table data is represented as follows: SessionID(hash), StartTime, StopTime, UserEmail


    TableName:  AWS_DDB_SESSIONS
    "AAADFR101", "01:10:00.000Z", "11:10:00.000Z", "[email protected]"
    "BBBLWL102", "12:20:00.000Z", "18:20:00.000Z", "[email protected]"
    "CCCUFE103", "03:00:00.000Z", "03:30:00.000Z", "[email protected]"

I can easily get a single entry if I specify the hash key value(SessionID). For example, if I wanted the first item in the list(AAADFR101), I would just make a "getItem" request with the specified SessionID:

Amazon's Java SDK PseudoCode:
  AmazonSDK.API.Table.getItem("AAADFR101");

The SQL equilavent is:  
  select * from AWS_DDB_SESSIONS where SessionID='AAADFR101'; 

Expected Results:
  "AAADFR101", "01:10:00.000Z", "11:10:00.000Z", "[email protected]"

However, if I want to get all the sessions that are associated with "[email protected]", I have no idea what to do.


    Amazon's Java SDK PseudoCode:
      // Create a Condition...
      Condition userEmailCondition = new Condition()
                .withComparisonOperator(ComparisonOperator.EQ.toString())
                .withAttributeValueList(new AttributeValue().withS("[email protected]"));  
      // Create the Map of Conditions...            
      Map userEmailConditionMap = new HashMap();
      userEmailConditionMap.put("userEmail", userEmailCondition);
      QueryRequest qRequest = new QueryRequest().withTableName("AWS_DDB_SESSIONS")
                    .withKeyConditions(userEmailConditionMap);
      QueryResult = AmazonSDK.API.Table.query(qCondition);
      // NOTE: It errors here because the query doesn't contain the "SessionID" 
      // value within the query(which is the primary key/hash for this table).

    The SQL equilavent is:
      select * from AWS_DDB_SESSIONS where UserID="[email protected]";

    Expected Results:
      "AAADFR101", "01:10:00.000Z", "11:10:00.000Z", "[email protected]"
      "CCCUFE103", "03:00:00.000Z", "03:30:00.000Z", "[email protected]"

As mentioned above, the Java pseudo code doesn't work because Amazon's DynamoDB API requires that primary key (SessionID) when you perform a query. Even if you changed the DynamoDB table's schema to use a hash key (SessionID) and a range (UserEmail), it does you no good. Essentially, you run into the same problem.

I tried to come up with some alternative solutions, but I seem to come back to the same problem no matter how I approach it.

Alternative #1 The first alternative that I thought of, was to add another attribute which was a unique identifier. So you would change the schema:


    Original Schema:  
      SessionID(hash), StartTime, StopTime, UserEmail

    New Table Schema:
      RecordID(hash), SessionID, StartTime, StopTime, UserEmail

This creates a table similar to one that you would use in a Relational Database (RDB). However, it suffers from the original problem of not being able to query on UserEmail.

Alternative #2 After some online research, it was stated that you could split the table up into smaller tables based on various attributes (userEmail). So for every user you could create a table to store their session info in. Hence, our original sample would be split into tables with the following data:


    TableName:  [email protected]
    "BBBLWL102", "12:20:00.000Z", "18:20:00.000Z"


    TableName:  [email protected]
    "AAADFR101", "01:10:00.000Z", "11:10:00.000Z"
    "CCCUFE103", "03:00:00.000Z", "03:30:00.000Z"

In some cases this approach is recommended because it would nudge you in the direction of removing the redundant data (UserEmail) and it may lower your costs because it scans less data while querying the tables. But it seems to create a scalability issue. If you have 5 people using your app, then you would only have 5 tables to worry about. However, if your user base grows to 1000, 100K, or 1 Million, then you would have a crazy amount of tables to keep track of. So I am not sure if this is the best approach.

Alternative #3 I tried to add some indexes for the attributes that I thought were important, but it didn't work as I had thought and still had the same original problem.

Alternative #4 Change the hash key from the SessionID, to the UserEmail, and then use the SessionID as the range key. But it seems that this may not work under certain circumstances such as when these values overlap, have different attribute fields, or if the attributes have the same value. For example, if John Doe had another Session with ID of "AAADFR101" then it would overwrite the existing entry instead of having 2 entries with similar information. You can overcome this challenge by adding a new attribute (RecordID) and use that as the primary key (hash value). However, you are back to the original problem would you try to query.

Ugh... My head is starting to hurt. Can someone point me in the right direction.

Thanks in advance,

-Randy

I don't understand the issue you have with Alternative 4. SessionId is supposed to be unique globally right? Thats why in the old schema you use sessionId as the hashkey. Let me know if I am making sense... — Erben Mo
Alternative 4 looks good. Using customerId as hashkey is very common. Also you need to make sure your sessionId is unique (use random id generator?). — Erben Mo
Thanks for the comment Erben. There are 2x challenges that I have with solution #4. — Randy B.
Problem A: If you use hash key or hash/range key, then you can only 1 occurrence of the key combination. Think of a web server. If you log every userId and sessionId into DynamoDB you are fine until you have a repeat of the same hash/range key. Since sessionId's are usually randomly generated, it is real possibility to have a duplicate. And if you do, it would overwrite the existing entry instead of inserting a new one. This isn't the desired behavior and is really bad from a security perspective because they data you have would be missing information. — Randy B.
Problem B: If you use hash/range key (loginEmail/sessionId), then you cannot query on date/time or any other attribute. Arosenber mentioned that you may get around this limitation by using a global index. But this causes a complete copy of the table (for each index). This would be costly and inefficient for large data sets. Not to mention that some clients may not be powerful enough to handle that much data. — Randy B.

arosenber arosenber · Accepted Answer · 2015-03-16T04:32:39

Welcome to dynamoDB, a great non-relational solution!

If all you want to add on is the ability to search by email you can do this by using global indexes, see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html. This will effectively have amazon maintain a second table for you...so think double the costs, but allow you to use either key. Since you mention scaling concerns, if you're worried about the distribution of your emails you can md5 hash them first. You can consider using start or stop time as your range key.

You're not going to be able to sort all your data by an attribute. If this is needed you'll need to look for another solution.

Amazon DynamoDB (many-to-many relationships)

1 Answers