1
votes

I went through this article which says that the data records are organized into groups called Shards, and these shards can be consumed and processed in parallel by Lambda function. I also found these slides from AWS webindar where on slide 22 you can also see that Lambda functions consume different shards in parallel. However I could not achieve parallel execution of a single function. I created a simple lambda function that runs for a minute. Then I started to create tons of items in DynamoDB expecting to get a lot of stream records. In spite of this, my functions was started one after another.

What i'm doing wrong?

2

2 Answers

0
votes

From the first article it is said:

Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order.

Yet, when working with Kinesis streams for example, you can achieve parallelism by having multiple shards as the order in which records are processed is guaranteed only within a shard.

Side note, it makes some sense to trigger lambda with Dynamodb events in order.

0
votes

Pre-Context:

How DaynamoDB stores data?

DynamoDB uses partition to store the table records. These partitions are abstracted from users and managed by DynamoDB team. As data grows in the table, these partitions are further divided internally.

What these dynamo streams all about?

DynamoDB as a data-base provides a way for user to retrieve the ordered changed logs (think of it as transnational replay logs of traditional database). These are vended as Dynamo table streams.

How data is published in streams?

Stream has a concept of shards (which is somewhat similar to partition). Shards by definition contains ordered events. With dynamo terminology, a stream shard will contains the data from a certain partition.

Cool!.. So what will happen if data grows in table or frequent writes occurs?

Dynamo will keep persisting the records based on HashKey/SortKey in its associated partition, until a threshold is breached (like table size and/or RCU/WCU counts). The exact value of these thresholds are not shared to us by dynamoDB, Though we have some document around rough estimation.

As this threshold is breached, dynamo splits the partition and do the re-hashing to distribute the data (somewhat) evenly across the partition.

Since new partitions have arrived, these data will be published to its own shards (mapped to its partition)


Great, so what about Lambda? How the parallel processing works then.

One lambda function process records from one and only one shard. Thus the number of shards present in the dynamo stream will decide the number of parallel running lambda function.

Vaguely you can think of, # of partitions = # shards = # of parallel lambda running.