1
votes

I am designing a DynamoDB database tables. In one table, say test_table, I have defined a composite key with the combination of a partition key , say partition_id and a sort key, say sort_id. Both partition_id and sort_id may be have some duplicate values in their entries but the combination of both partition_id and sort_id together will always be unique.

I am interested to know, how efficient is the retrieval of total records having some partition_id, say partition_id = x ?

3

3 Answers

5
votes

The answer to your actual question is...very inefficient.

Unlike a RDBMS, DDB doesn't have a COUNT or any other aggregation functions...so the the only way to get a count is to Query() or Scan() your data, returning that data to your application and counting the rows yourself.

Best practice if you need aggregates, would be to enable DDB Streams, and tie a Lambda to it that would calculate and store the aggregate (count) you're interested in.

So in the case you ask about,

partition_id = x

You could add another row to your table, (pk=X, sk="COUNT") and the lambda would update that row every time a record with pk=x is inserted or deleted.

Then you can efficiently get the count with a simple GetItem(pk=X, sk="COUNT")

More info in the docs Using Global Secondary Indexes for Materialized Aggregation Queries

1
votes

EDIT: Sorry I misinterpreted the question and was considering efficiency of returning records, rather than the count of records. Returning the count of records would have a similar efficiency as a query, and therefore consume N read credit units based on the total number of records. This is inefficient compared to a solution in which aggregates records and can return the aggregate in a GetItem query (consuming 1 read credit unit) as posed in the other answer.

The article Partitions and Data Distribution provides great detail on the efficiency of querying by the partition key.

Retrieving records by the Parition Key is always the most efficient way to retrieve records. Best Practices for Querying and Scanning Data reviews the efficiency of a query (which is using the Partition Key) versus a scan (using any other field in the table).

Finially, with a large amount of records that have the same partition key, you will consume more read capacity units. Therefore a GetItem is more efficient than a Query. This answer goes into great detail about the cost of a query for multiple records that have the same partition key and also contains more links that may be useful to you. - Counting these items will become more inefficient as more items are added

1
votes

DynamoDB is a NoSQL DB that is designed for scalable and efficient Lookup operations and not for analytical operations. Unlike RDBMS that seems to be good in both use cases of OLTP and OLAP, DynamoDB is not trying to allow both access patterns.

If you want to run analytics on the data that you have in DynamoDB, you need to replicate the data to S3 and run your analytics on that data using Amazon Athena. You can replicate that data either using DynamoDB Streams -> Lambda/Kinesis Firehose -> S3 or through an AWS Glue job (see an example: https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/)