1
votes

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.

How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?

These are my current options:

  1. Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
  2. Aggregate all update/delete records and process them on an hourly basis "somehow".

Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.

2

2 Answers

0
votes

Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.

Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.

Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.

The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.

0
votes

The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).

I found the answer valid on my use case when:

  • Keep Redshift as up to date as possible without causing blocking.
  • Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.

This is the architecture:

upsert architecture on redshift using kinesis

So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.

After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.

For further detail on the UPSERT operation, I've found this source very useful.