0
votes

I am working on copying huge amount of data (100 million plus entries) from a dynamo DB table to Redshift and I need to filter the data based on some criteria.I have evaluated couple of ways of achieving this task :

  1. Using Redshift COPY Command : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html (http://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-dynamodb.html). Cons of this approach : COPY Command would affect the throughput of the source dynamo db table and it is not recommended to be used for production DDB tables. (The read ratio regulates the percentage of provisioned throughput of the source dynamo db table that is consumed. it is recommended to set this ratio to a value less than average unused provisioned throughput of the source table.)

  2. Using AWS Datapipeline : Use RedshiftCopyActivity only :(http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html) to copy data directly from dynamo db to redshift and then run query on redshift to filter based on criteria.

I couldn't find any information on whether the throughput of the source dynamo db table would be affected while using RedshiftCopyActivity as well. Could someone please provide any information on the same ?

Also, would copying data to S3 from dynamoDB and then copying to Redshift from S3 be more beneficial than directly copying from dynamo DB to Redshift ?

1

1 Answers

0
votes

Try to minimize touching Dynamo. Generally I'd say it's a bad idea to use it for anything else than a key-value store. Any logic should happen in Redshift.