0
votes

Assuming we're using AWS Triggers on DynamoDB Table, and that trigger is to run a lambda function, whose job is to update entry into CloudSearch (to keep DynamoDB and CS in sync).

I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:

  1. Application updates a DynamoDB table's Record A (say to A1)
  2. Very closely after that Application updates same table's same record A (to A2)
  3. Trigger for 1 causes Lambda of 1 to start execute
  4. Trigger for 2 causes Lambda of 2 to start execute
  5. Step 4 completes first, so CloudSearch sees A2
  6. Now Step 3 completes, so CloudSearch sees A1

Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)

As we can see, the thing goes out of sync.

The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be processed. Assuming the aforementioned statement is true, how to ensure the sync happens correctly, if there is so much of data ingestion into DynamoDB that more than one shards are needed n Kinesis?

2

2 Answers

0
votes

You may achieve that using DynamoDB Streams:

DynamoDB Streams

"A DynamoDB stream is an ordered flow of information about changes to items in an Amazon DynamoDB table."

DynamoDB Streams guarantees the following:

  • Each stream record appears exactly once in the stream.
  • For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.

Another cool thing about DynamoDB Streams, if your Lambda fails to handle the stream (any error when indexing in Cloud Search for example) the event will keep retrying and the other record streams will wait until your context succeed.

We use Streams to keep our Elastic Search indexes in sync with our DynamoDB tables.

0
votes

AWS Lambda F&Q Link

Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?

The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.

So that means Lambda would pick the Records in one shard one by one, in order they appear in the Shard, and not execute a new record until previous record is processed!

However, the other problem that remains is what if the entries of the same record are present across different shards? Thankfully, AWS DynamoDB Streams ensure that primary key only resides in a particular Shard always. (Essentially, I think, the Primary Key is what is used to find the hash to point to a shard) AWS Slide Link. See more from AWS Blog below:

The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.