I am looking into replicating DynamoDB into ElasticSearch (ES). We evaluated the logstash input plugin for this purpose, but found the following drawbacks:
- logstash in a pull mode does not have HA/failover features. It becomes a SPOF for replication
- since we do not want to do application level joins on ES indexes, we want to merge multiple tables into one ES document. The plugin does not provide capabilities for this use case.
Hence, we are evaluating the following two approaches
- Lambdas read the DynamoDB stream and push them to ES via SQS
- Our own DynamoDB stream processor to replace AWS lambdas
Now coming to the actual problem: Ordering is important in replicating data from the Dynamo streams to ES since there could be multiple mutations for the same entity. From the Streams/Lambda documentation, it is mentioned that contents in different stream shards will be processed by lambdas concurrently.
AWS does not document (or at least I have not been able to locate) details of how DynamoDB mutations are mapped to stream shards - whether there is any correlation to hash keys of tables, or if it is some kind of bin-packing algorithm.
Not having control of which stream shard a mutation is mapped to does not provide developer capability to control the parallelization of stream processing. Approach #1 above could update the same ES document out of order. Approach #2 can solve by processing serially, but does not allow parallelization/scale of replication (even across data partitions) given that there is no contract on the shard placement strategy.
Any thoughts on how to scale and also make the replication resilient to failures? Or could someone shed light on how mutations are placed into dynamodb stream shards?