DynamoDB pricing using DynamoDB Storage Backend for Titan

Question

I would like to get a good understanding of what would be the price (in terms of $) of using DynamoDB Titan backend. For this, I need to be able to understand when DynamoDB Titan backend does reads and writes. Right now I am pretty clueless.

Ideally I would like to run a testcase which adds some vertices, edges and then does a rather simple traversal and then see how many reads and writes were done. Any ideas of how I can achieve this? Possibly through metrics?

If it turns out I can't extract this information myself, I would very much appreciate a first brief explanation about when DynamoDB Titan backend performs reads and writes.

Alexander Patrikalakis Alexander Patrikalakis · Accepted Answer · 2016-01-14T14:29:44

For all Titan backends, to understand and estimate the number of writes, we rely on estimating the number of columns for a given KCVStore. You can also measure the number of columns that get written using metrics when using the DynamoDB Storage Backend for Titan.

To enable metrics, enable the configuration options listed here. Specifically, enable lines 7-11. Note the max-queue-length configuration property. If the executor-queue-size metric hits max-queue-length for a particular tx.commit() call, then you know that the queue / storage.buffer-size were not large enough. Once the executor-queue-size metric peaks without reaching max-queue-length, you know you have captured all the columns being written in a tx.commit() call, so that will give you the number of columns being changed in a tx.commit(). You can look at UpdateItem metrics for edgestore and graphindex to understand the spread of columns between the two tables.

All Titan storage backends implement KCVStore, and the keys and columns have different meanings depending on the kind of store. There are two stores that get the bulk of writes, assuming you have not turned on user-defined transaction logs. They are edgestore and graphindex.

The edgestore KCVStore is always written to, regardless of whether you configure composite indexes. Each edge and all of the edge properties of that edge are represented by two columns (unless you set the schema of that edge label to be unidirectional). The key of edge columns are the out-vertex of an edge in the direct column, and the in-vertex of an edge in the reverse. Again, the column of an edge is the in-vertex of an edge in the direct column, and the out-vertex of an edge in the reverse. Each vertex is represented by at least one column for the VertexExists hidden property, one column for a vertex label (optional) and one column for each vertex property. The key of vertices is the vertex id and the columns correspond to vertex properties, hidden vertex properties, and labels.

The graphindex KCVStore will only be written to if you configure composite indexes in the Titan management system. You can index vertex and edge properties. For each pair of indexed value and edge/vertex that has that indexed value, there will be one column in the graphindex KCVStore. The key will be a combination of the index id and value, and the column will be the vertex/edge id.

Now that you know how to count columns, you can use this knowledge to estimate the size and number of writes to edgestore and graphindex when using the DynamoDB Storage Backend for Titan. If you use the multiple-item data model for a KCVStore, you will get one item for each key-column pair. If you use the single-item data model for a KCVStore, you will get one item for all columns at a key (this is not necessarily true when graph partitioning is enabled but this is a detail I will not discuss now). As long as each vertex property is less than 1kb, and the sum of all edge properties for an edge are less than 1 kb, each column will cost 1 WCU to write when using multiple-item data model for edgestore. Again, each column in the graphindex will cost 1 WCU to write if you use the multiple-item data model.

Lets assume you did your estimation and you use multiple-item data model throughout. Lets assume you estimate that you will be writing 750 columns per second to edgestore and 750 columns per second to graphindex, and that you want to drive this load for a day. You can set the read capacity for both tables to 1, so you know each table will start off with one physical DynamoDB partition to begin with. In us-east-1, the cost for writes is $0.0065 per hour for every 10 units of write capacity, so 24 * 75 * $0.0065 is $11.70 per day for writes for each table. This means the write capacity would cost $23.40 per day for edgestore and graphindex together. The reads could be set to 1 read per second for each of the tables, making the read cost 2 * 24 * $0.0065 = $0.312 for both tables per day. If your AWS account is new, the reads would fall within the free tier, so effectively, you would only be paying for the writes.

Another aspect of DynamoDB pricing is storage. If you write 750 columns per second, that is 64.8 million items per day to one table, that means 1.9 billion (approximately 2 billion) items per month. The average number of items in the table in a month is then 1 billion. If each items averages out to 412 bytes, and there is 100 bytes of overhead, then that means 1 billion 512 byte items are stored for a month, approximately 477 GB in a month. 477 / 25 rounded up is 20, so storage for the first month at this load would cost 20 * $0.25 dollars a month. If you keep adding items at this rate without deleting them, the monthly storage cost will increase by approximately 5 dollars per month.

If you do not have super nodes in your graph, or vertices with a relatively large number of properties, then the writes to the edgestore will be distributed evenly throughout the partition key space. That means your table will split into 2 partitions when it hits 10GB, and then each of those partitions will split into a total of 4 partitions when they hit 10GB, and so on and so forth. the nearest power of 2 to 477 GB / (10 GB / partition) is 2^6=64, so that means your edgestore would split 6 times over the course of the first month. You would probably have around 64 partitions at the end of the first month. Eventually, your table will have so many partitions that each partition will have very few IOPS. This phenomenon is called IOPS starvation. You should have a strategy in place to address IOPS starvation. Two commonly used strategies are 1. batch cleanup/archival of old data and 2. rolling (time-series) graphs. In option 1, you spin up an EC2 instance to traverse the graph and write old data to a colder store (S3, Glacier etc) and delete it from DynamoDB. In option 2, you direct writes to graphs that correspond to a time period (weeks - 2015W1, months - 2015M1, etc). As time passes, you down provision the writes on the older tables, and when time comes to migrate them to colder storage, you read the entire graph for that time period and delete the corresponding DynamoDB tables. The advantage of this approach is that it allows you to manage your write provisioning cost with higher granularity, and it allows you to avoid the cost of deleting individual items (because you delete a table for free instead of incurring at least 1 WCU for every item you delete).

DynamoDB pricing using DynamoDB Storage Backend for Titan

1 Answers