13
votes

In the docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/APISummary.html) it states:

You can query only tables whose primary key is of hash-and-range type

and

we recommend that you design your applications so that you can use the Query operation mostly, and use Scan only where appropriate

It's not directly stated, but does this make it best practice to use hash-and-range primary keys?

EDIT:

Answer TL;DR: Use whichever primary key type that makes sense for your data model and use secondary indexes for better querying support.

References:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html

http://www.allthingsdistributed.com/2013/12/dynamodb-global-secondary-indexes.html

https://forums.aws.amazon.com/thread.jspa?messageID=604862

In what situation do you use Simple Hash Keys on DynamoDB?

3

3 Answers

15
votes

The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).

However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.

In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.

You can find below a formal definition for the use of these two keys:

"Composite Hash Key with Range Key allows the developer to create a primary key that is the composite of two attributes, a 'hash attribute' and a 'range attribute.' When querying against a composite key, the hash attribute needs to be uniquely matched but a range operation can be specified for the range attribute: e.g. all orders from Werner in the past 24 hours, or all games played by an individual player in the past 24 hours." [VOGELS]

So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:

"Dynamo uses consistent hashing to partition its key space across its replicas and to ensure uniform load distribution. A uniform key distribution can help us achieve uniform load distribution assuming the access distribution of keys is not highly skewed." [DDB-SOSP2007]

Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.

Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.


Footnotes:

  • The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc

  • The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.

5
votes

Not necessarily. It is best to choose a primary key that supports the access patterns for your use case.

For example, let's say you want to have a table for Users. You will store the details for a single user (name, email, creator, etc.). Your access pattern might be that you are fetching the details for a specific User. In this case it makes more sense to use a primary key of type hash, with a hash key of userId.

Let's say you also want another table that stores Groups. Your access pattern might be that you want to get all members for a given group. Here, it makes more sense to use a primary key of type hash and range, with your hash and range keys respectively being groupId and userId.

The important things to know are the differences between both key types (quote below) and the Guidelines for Working with Tables:

  • Hash Type Primary Key—The primary key is made of one attribute, a hash attribute. DynamoDB builds an unordered hash index on this
    primary key attribute. Each item in the table is uniquely identified
    by its hash key value.

  • Hash and Range Type Primary Key—The primary key is made of two attributes. The first attribute is the hash attribute and the second
    one is the range attribute. DynamoDB builds an unordered hash index
    on the hash primary key attribute, and a sorted range index on the
    range primary key attribute. Each item in the table is uniquely
    identified by the combination of its hash and range key values. It is possible for two items to have the same hash key value, but those two items must have different range key values.

You can see more about best practices in the Dynamo DB Guidelines for Working with Tables documentation

2
votes

As others have already said - no you should not.

The statement that confused and caused you to ask this question in the first place is wrong:

You can query only tables whose primary key is of hash-and-range type

You can query tables whose primary key is of single-attribute (only partition) type.

Proof:

# Create single-attribute primary key table
aws dynamodb create-table --table-name testdb6 --attribute-definitions '[{"AttributeName": "Id", "AttributeType": "S"}]' --key-schema '[{"AttributeName": "Id", "KeyType": "HASH"}]' --provisioned-throughput '{"ReadCapacityUnits": 5, "WriteCapacityUnits": 5}' 

# Populate table
aws dynamodb put-item --table-name testdb6 --item '{ "Id": {"S": "1"}, "LastName": {"S": "Lopez"}, "FirstName": {"S": "Maria"}}'
aws dynamodb put-item --table-name testdb6 --item '{ "Id": {"S": "2"}, "LastName": {"S": "Fernandez"}, "FirstName": {"S": "Augusto"}}'

# Query table using only partition attribute
aws dynamodb query --table-name testdb6 --select ALL_ATTRIBUTES --key-conditions '{"Id": {"AttributeValueList": [{"S": "1"}], "ComparisonOperator": "EQ"}}'

Output of the last command (it works):

{
"Count": 1,
"Items": [
    {
        "LastName": {
            "S": "Lopez"
        },
        "Id": {
            "S": "1"
        },
        "FirstName": {
            "S": "Maria"
        }
    }
],
"ScannedCount": 1,
"ConsumedCapacity": null
}