0
votes

I'm storing some events into DynamoDB. I have to sync (i.e. copy incrementally) the data with Redshift. Ultimately, I want to be able to analyze the data through AWS Quicksight.

I've come across multiple solutions but those are either one-time (using the one-time COPY command) or real-time (a streaming data pipeline using Kinesis Firehose).

The real-time solution seems superior to hourly sync, but I'm worried about performance and complexity. I was wondering if there's an easier way to batch the updates on an hourly basis.

1

1 Answers

2
votes

What you are looking for are DynamoDB Streams (official docs). This can seamlessly flow into the Kinesis firehose as you have correctly pointed out.

This is the most optimal way and provides the best balance between the cost, operational overhead and the functionality itself. Allow me explain how:

  1. DynamoDB streams: Streams are triggered when any activity happens on the database. This means that unlike a process that will scan the data on a periodic basis and consume the read capacity even if there is no update, you will be notified of the new data.
  2. Kinesis Firehose: You can configure the Firehose to batch data either by the size of the data or time. This means that if you have a good inflow, you can set the stream to batch the records received in each 2 minutes interval and then issue just one COPY command to redshift. Same goes for the size of the data in the stream buffer. Read more about it here.

The ideal way to load data into Redshift is via COPY command and Kinesis Firehose does just that. You can also configure it to automatically create backup of the data into S3.

Remember that a reactive or push based system is almost always more performant and less costly than a reactive or push based system. You save on the compute capacity needed to run a cron process and also continuously scan for updates.