I am using Amazon DynamoDB to store event based data for activity streams.
I automatically create a new table for each month and intend to store the event data in each relevant table. This way I can quickly prune off old months when needed simply by deleting the old table, as well as better provision load towards the more recent tables.
However based from reading the amazon docs I can see that the hash key itself is very important.
Provisioned throughput is dependent on the primary key selection, and the workload patterns on individual items. When storing data, Amazon DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions.
I am having a hard time getting my head around this.
Therefore my question, with the above is mind, which hash key would be better between these two:
1382465533_john.doe
or:
john.doe_1382465533
The above keys are a composite of the userid and the timestamp of event.
How these tables will be queried...
These tables will NOT have a range key as for this use case it is not required.
This data will be used to construct activity feeds for users.
When an event occurs the individual activity id is pushed (fanned-out) into the users followers redis lists (one list for each user);
Therefore when a user requests their stream we do the following:
- Get list of activityid's from Redis
- Loop through the activityid's and construct a BatchGetItem query to pull them out of DynamoDB.
With all that in mind what I need to understand is how best to define my hash key in the activity tables. Timestamp first or userid first. What logic does DynamoDB use to automatically partition the hash keys?
Thanks in advance for any advice.