27
votes

I want to know how many items are in my dynamodb table. From the API guide, one way to do it is using a scan as follows:

<?php
$dynamodb = new AmazonDynamoDB();

$scan_response = $dynamodb->scan(array(
    'TableName' => 'ProductCatalog' 
));

echo "Total number of items: ".count($scan_response->body->Items)."\n";

However, this has to fetch all items and store them in an array in memory which isn't feasible in most cases I would presume. Is there a way to get the total item count more efficiently?

This data is not available in the AWS Dynamo web-console, I have already checked. (at first it looks like it is shown alongside the pagination buttons, but it turns out the figure grows as you go to the next page of items).

8
Use DescribeTable, details here: stackoverflow.com/a/37036989/3305145Tyler

8 Answers

26
votes

I can think of three options to get the total number of items in a DynamoDB table.

  1. The first option is using the scan, but the scan function is inefficient and is in general a bad practice, especially for tables with heavy reads or production tables.

  2. The second option is what was mention by Atharva:

    A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.

    The only problem this is that increment operations are not idempotent. So if a write fails or you write more than once this will be reflected in the count. If you need pin-point accuracy, use a conditional update instead.

  3. The simplest solution is the DescribeTable which returns ItemCount. The only issue is that the count isn't up to date. The count is updated every 6 hours.

http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html

13
votes

The Count option is definitely what you want, but you also have to take into account that there may be one or more "page" of results in your Scan result. The Scan operation only scans 1MB of data in your table at a time, so the value of Count in the result is only going to reflect the count of the first 1MB of the table. You will need to make subsequent requests using the value of LastEvaluatedKey in the result (if it is there). Here is some sample code for doing something like that:

<?php

$dynamo_db = new AmazonDynamoDB();

$total = 0;
$start_key = null;
$params = array(
    'TableName' => 'my-table',
    'Count'     => true
);

do {
    if ($start_key) {
        $params['ExclusiveStartKey'] = $start_key->getArrayCopy();
    }

    $response = $dynamo_db->scan($params);

    if ($response->isOK()) {
        $total += (string) $response->body->Count;

        if ($response->body->LastEvaluatedKey) {
            $start_key = $response->body->LastEvaluatedKey->to_array();
        } else {
            $start_key = null;
        }
    }
} while ($start_key);

echo "Count: {$total}";
7
votes

Aha, there is a Count option in the scan API, see http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonDynamoDB/scan

<?php
$dynamodb = new DynamoMetadata();

$scan_response = $dynamodb->scan(array(
    'TableName' => 'ProductCatalog'
    'Count'     => true,
));

echo "Count: ".$scan_response->body->Count."\n";
5
votes

If you are interested in using the total number of items in a table in your application's logic, that means you are going to query for the total counts pretty frequently. Now one way to achieve this is by using scan operation. But remember that scan operation literally scans through the whole table and therefore consumes lots of throughput, so all the query operations will receive Throttled Exception in that duration. And even considering the fact that scan will limit the resultant count by size of 1MB, you will have to make repeated scan operations to get the actual number of items if the table is very large. This will require to write a custom query logic and handle inevitable throttling in query operations.

A better solution that comes to my mind is to maintain the total number of item counts for such tables in a separate table, where each item will have Table name as it's hash key and total number of items in that table as it's non-key attribute. You can then keep this Table possibly named "TotalNumberOfItemsPerTable" updated by making atomic update operations to increment/decrement the total item count for a particular table.

No issue of throttling or 1MB limit.

Furthermore, you can expand this concept to even further granularity for example to maintain total number of items matching with some hash key or any arbitrary criteria which you can encode in string form to make an entry in your table named something like "TotalNumberOfItemsInSomeCollection" or "TotalNumberOfItemsMatchingSomeCriteria". These tables can then contain entries for number of items per table, per collection or items matching with some criteria.

2
votes

An approximate item count value (supposedly updated every six hours) is available in the AWS console for DynamoDB. Just select the table and look under the Details tab, last entry is Item Count. If this works for you, then you can avoid consuming your table throughput to do the count.

1
votes

This is now available in the AWS table overview screen under the section 'Table details', field 'Item count'. It appears to be just a dump of DescribeTable, and notes that its updated roughly every six hours.

0
votes

Here's how I get the exact item count on my billion records DynamoDB table:

hive>

set dynamodb.throughput.write.percent = 1;
set dynamodb.throughput.read.percent = 1;
set hive.execution.engine = mr;
set mapreduce.reduce.speculative=false;
set mapreduce.map.speculative=false;

CREATE EXTERNAL TABLE dynamodb_table (`ID` STRING,`DateTime` STRING,`ReportedbyName` STRING,`ReportedbySurName` STRING,`Company` STRING,`Position` STRING,`Country` STRING,`MailDomain` STRING) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "BillionData", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime,ReportedbyName:ReportedbyName,ReportedbySurName:ReportedbySurName,Company:Company,Position:Position,Country:Country,MailDomain:MailDomain");

SELECT count(*) FROM dynamodb_table;

*You should have a EMR cluster, which comes installed with Hive and DynamoDB record Handler. *With this command, DynamoDB handler on the hive issues "PARALLEL SCANS" with multiple Mapreduce mappers(AKA Workers) working on different partitions to get the count. This will be much efficient and faster than normal scans.
*You must be willing to bump up Read capacity very high for certain period of time. * On a decent sized(20 node) cluster , With 10000 RCU , it took 15 minutes to get count on billion records Approx.
* New writes on this DDB table during this period will make the count inconsistent.

-1
votes

Please do not use scan method of dynamoDb because it read all data of tabel but in this case u need only count so use this .

$scan_response = $dynamodb->describeTable(array(
'TableName' => 'ProductCatalog' ));

and print this for result $scan_response['Table']['ItemCount']