2
votes

I'm exploring the use of DynamoDB in the application I work on, which currently only has one database component -- a MySQL database running on RDS.

We pretty heavily use AWS and use a sharding scheme for our databases, but it can only get us so far without manual intervention. Playing around with Aurora I actually saw a significant drop in performance vs our MySQL database, so I'm evaluating DynamoDB to see it will work for us, as it can efficiently store JSON data, and also scale easily (just increase the reads or writes per second in the AWS console and let Amazon do the heavy lifting).

In several of our MySQL tables we have a primary key that is an autoincrement column, but we also have several indices on top of that to support query performance in other ways. The other indices are crucial as some of our tables have over 1 billion rows in them. In essence, we partition things among a client, an object_name, etc. So I might do something like this in MySQL:

Create Table: CREATE TABLE `record` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `client_id` int(10) unsigned NOT NULL,
  `data_id_start` bigint(20) unsigned NOT NULL,
  `data_id_end` bigint(20) unsigned NOT NULL DEFAULT '8888888888888888',
  `object_name` varchar(255) NOT NULL,
  `uuid` varchar(255) NOT NULL,
  `deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  ...
  PRIMARY KEY (`id`),
  ...
  KEY `client_id_object_name_data_id_data_id_end_deleted` (`client_id`,`object_name`,`data_id_start`,`data_id_end`,`deleted`),
  KEY `client_id_object_name_data_id_end_uuid_id` (`client_id`,`object_name`,`data_id_end`,`uuid_id`),
  ...
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

I'm evaluating duplicating some of this data into DynamoDB to use as a cache, so we don't have to go out to S3 to retrieve stored data there under certain situations. Instead, I'll just store the JSON data directly in the cache. In DynamoDB, it looks like I could use a HASH or a HASH and RANGE attribute in a key. So for example, I could use the autoincrement column from our MySQL table as the HASH, but then all of the examples I see of RANGE keys, global/local secondary indices, etc. only specify ONE other attribute as the RANGE. I want to create an index for efficient lookup when 3 or more values are specified in the "where" clause.

For example, I would like to query this table using an expression like this:

var params = {
    TableName: "Cache",
    KeyConditionExpression: "clientId = :clientId and objectName = :objectName and uuid = :uuid",
    ExpressionAttributeValues: {
        ":clientId": 17,
        ":objectName": "Some name",
        ":uuid": "ABC123-KDJFK3244-CCB"
    }
};

Notice that my "where clause" in the KeyConditionExpression uses 3 values. It's possible that we might have 4 or 5 values there. So is there any way in DynamoDB to create composite keys that have more than 2 attributes (columns) in them?

If not, I suppose that I could concatenate the 3 columns into a string and use that as my primary key on each insert. Or at least concatenate clientId and objectName, then use uuid as a RANGE or something like that. Effectively I need to page through all values for a specific clientId/objectName combination, and then based on some of the attributes in each row either take its value directly from the cache, or consider it a miss and retrieve the value from S3 (which is considerably slower).

1

1 Answers

1
votes

DynamoDB allows consistent low-latency queries on essentially infinite amount of data for this. The model you suggested with concatenating the values seems to be a good approach.

One thing to note is that hash key attribute values are limited to 2048 bytes. If the values you are concatenating are not predictable lengths (you can't pad them nicely) or exceed this limit, it may be a better approach to hash the value of the item and search based on the hash of the item. Here is the relevant documentation on limits: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html. DynamoDB items are also limited to 400KB total data.

For correctness, I would also use some unique identifier for a range key, this will allow collisions for hash values (even if it's rare) and the schema is scalable because there are a small number of items per hash key value.