2
votes

I am looking into replicating DynamoDB into ElasticSearch (ES). We evaluated the logstash input plugin for this purpose, but found the following drawbacks:

  • logstash in a pull mode does not have HA/failover features. It becomes a SPOF for replication
  • since we do not want to do application level joins on ES indexes, we want to merge multiple tables into one ES document. The plugin does not provide capabilities for this use case.

Hence, we are evaluating the following two approaches

  1. Lambdas read the DynamoDB stream and push them to ES via SQS
  2. Our own DynamoDB stream processor to replace AWS lambdas

Now coming to the actual problem: Ordering is important in replicating data from the Dynamo streams to ES since there could be multiple mutations for the same entity. From the Streams/Lambda documentation, it is mentioned that contents in different stream shards will be processed by lambdas concurrently.

AWS does not document (or at least I have not been able to locate) details of how DynamoDB mutations are mapped to stream shards - whether there is any correlation to hash keys of tables, or if it is some kind of bin-packing algorithm.

Not having control of which stream shard a mutation is mapped to does not provide developer capability to control the parallelization of stream processing. Approach #1 above could update the same ES document out of order. Approach #2 can solve by processing serially, but does not allow parallelization/scale of replication (even across data partitions) given that there is no contract on the shard placement strategy.

Any thoughts on how to scale and also make the replication resilient to failures? Or could someone shed light on how mutations are placed into dynamodb stream shards?

1
This is a few months old; did you ever find an answer to this question?Jacob

1 Answers

0
votes

Someone from AWS (or more experience) should clarify, but my understanding is that each Dynamo partition maps initially to one shard. When this shard fills up, child shards will be created. Each shard and its children are processed sequentially by a single KCL worker.

Since an item's partition key is used to decide its desitnation shard, mutations of same item will land in the same shard (or its children). A shard and its children are guaranteed to be processed in the right order by a single KCL worker. Each KCL worker also maps to a single lambda instance, so same item will never be processed in parallel for different mutations.

Although Dynamo streams is different from Kinesis streams, reading Kinesis documentation helped place some pieces in the puzzle. There is also an interesting blog with very useful information.