How to process DynamoDB Stream in a Spark streaming application

Question

I would like to consume a DynamoDB Stream from a Spark Streaming application.

Spark streaming uses KCL to read from Kinesis. There is a lib to make KCL able to read from a DynamoDB Stream: dynamodb-streams-kinesis-adapter.

But is it possible to plug this lib into spark? Anyone done this?

I'm using Spark 2.1.0.

My backup plan is to have another app reading from DynamoDB stream into a Kinesis stream.

Thanks

I've been able to consume from a DynamoDB stream by tweaking: KinesisUtils, KinesisInputDStream and KinesisReceiver. The real change being in the KinesisReceiver where I use a com.amazonaws.services.dynamodbv2.streamsadapter.StreamsWorker. — Raphaël Douyère

ravi72munde ravi72munde · Accepted Answer · 2018-12-02T08:12:24

The way to do this it to implement the KinesisInputDStream to use the worker provided by dynamodb-streams-kinesis-adapter The official guidelines suggest something like this:

final Worker worker = StreamsWorkerFactory .createDynamoDbStreamsWorker( recordProcessorFactory, workerConfig, adapterClient, amazonDynamoDB, amazonCloudWatchClient);

From the Spark's perspective, it is implemented under the kinesis-asl module in KinesisInputDStream.scala

I have tried this for Spark 2.4.0. Here is my repo. It needs little refining but gets the work done

https://github.com/ravi72munde/spark-dynamo-stream-asl

After modifying the KinesisInputDStream, we can use it as shown below. val stream = KinesisInputDStream.builder .streamingContext(ssc) .streamName("sample-tablename-2") .regionName("us-east-1") .initialPosition(new Latest()) .checkpointAppName("sample-app") .checkpointInterval(Milliseconds(100)) .storageLevel(StorageLevel.MEMORY_AND_DISK_2) .build()

How to process DynamoDB Stream in a Spark streaming application

1 Answers