3
votes

I am using Amazon DynamoDB to store event based data for activity streams.

I automatically create a new table for each month and intend to store the event data in each relevant table. This way I can quickly prune off old months when needed simply by deleting the old table, as well as better provision load towards the more recent tables.

However based from reading the amazon docs I can see that the hash key itself is very important.

Provisioned throughput is dependent on the primary key selection, and the workload patterns on individual items. When storing data, Amazon DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions.

I am having a hard time getting my head around this.

Therefore my question, with the above is mind, which hash key would be better between these two:

1382465533_john.doe

or:

john.doe_1382465533

The above keys are a composite of the userid and the timestamp of event.

How these tables will be queried...

These tables will NOT have a range key as for this use case it is not required.

This data will be used to construct activity feeds for users.

When an event occurs the individual activity id is pushed (fanned-out) into the users followers redis lists (one list for each user);

Therefore when a user requests their stream we do the following:

  1. Get list of activityid's from Redis
  2. Loop through the activityid's and construct a BatchGetItem query to pull them out of DynamoDB.

With all that in mind what I need to understand is how best to define my hash key in the activity tables. Timestamp first or userid first. What logic does DynamoDB use to automatically partition the hash keys?

Thanks in advance for any advice.

1

1 Answers

4
votes

As per your question, I'd say that it doesn't matter how you compose your hash key, as you will have to query your table using the exact value for that hash key, and DynamoDB will treat it as a string anyway. Another thing would be if you were composing a range key, then you'd probably want to compose it as follows

john.doe_1382465533

so you can easily query your table like this

hash key = whatever, range key >= john.doe_1382460000

That said, maybe you can get rid of your Redis activity feed by integrating it directly into DynamoDB like this:

hash key: user id

range key: timestamp

the rest of the activity data

So instead of pushing the activity into DynamoDB and the activity id into Redis, you only have to push it and query it from the same DynamoDB table. I don't know if this would be compatible with the rest of your application, but here's an idea.