1
votes

Can some please help me in this calculation. Is this the way we calculate for pricing ?

We have a stream which creates an average of 8 million records/ hour (2500 records/second) and we want to insert into DynamoDB using aws Lambda.

Each records have max of 450 Bytes in size. What I understand is for less than 1 KB , Dynamodb takes 1 Write capacity unit. So for 2500 records / second we need 2500 Write capacity unit.

Data get inserted to DynamoDB in real time. For analysis purpose , we need the same data to be in Redshift as well. So every 15 minutes we need to sync Dynamodb table to Redshift table. Business want to do Analytics on top Redshift cluster only.

To do a copy data from DynamoDB to redshift using redshift COPY command in 15 minutes I need approximate 10000 Read capacity unit.

For doing this it will cost approx USD 2400/month


More info

Existing architecutre

S3 => Lambda => Redshift : This works fine till lambda , when it comes to redshift , jobs are getting into Queue, and other Jobs are getting delayed. So we checked with Amazon Tech support : They clearly mention , for real time or near real time Injection , Redshift is not a good choose. Instead of Lambda we use EMR clusster , created manifest file ( adding multiple files) and did a redshift copy - Still the result was satisfied. Business dint want to increase the cluster right now.

Proposed architecture

S3 => Lambda => DynamoDB (as soon as file Landed )=> Every 15 miutes => Query data from Dynamo DB and Save to S3 => Redshift

The ask from business is how much cost is Every 15 minutes we are querying Dynamodb data with record count of 4 million . Save it to S3 Do a Redshfit Copy with that S3.

If I give a read unit as 10,000 - can the read from Dynamodb(4.5 million) completes in 7 minutes

10000 * 60(seconds) = 6,00,000 records

6,00,000 * 7 minutes = 4.2 million

My calculation is

7 minutes to read from Dynamo DB

3 minutes to write to S3

3 minutes to do redshift COPY

In short Business needs the data to be in Dynamodb as soon as file landed in S3. They can wait 15 minutes to reflect in Redshift.

Any other suggestion on this

1
What do you mean by "in 15 minutes"? Is the data coming in continuously and you want to write it to DynamoDB, or is this a one-off exercise that you want to load a certain number of records in 15 minutes? Your 2500 figure is about right, but what has it got to do with 15 minutes?John Rotenstein
Data get inserted to DynamoDB in real time. For analysis purpose , we need the same data to be in Redshift as well. So every 15 minutes we need to sync Dynamodb table to Redshift table. Business want to do Analytics on top Redshift cluster only.Sachin Sukumaran
Why not use AWS Lamda to dump data on s3 and create an external table in Redshift and query it using Redshift spectrum?sardar
@sardar Yes, that's much better! Copying data from DynamoDB to Redshift is quite "expensive" in terms of DynamoDB capacity. It would be much easier to use Redshift Spectrum to read it directly from S3.John Rotenstein
@SachinSukumaran What are you using to provision the "stream"? Are you using Amazon Kinesis? If so, it makes it very easy to move the data to Redshift. Can you provide a high-level description of the total system you are architecting, such as how you will use DynamoDB, Redshift, etc? This will help us provide a more suitable answer for you. Feel free to edit your question to add more details. Please concentrate on what you're trying to achieve rather than how you want to achieve it.John Rotenstein

1 Answers

1
votes

You appear to have two requirements:

  • Load data from Amazon S3 into DynamoDB
  • Load data from Amazon S3 into Amazon Redshift

S3 to DynamoDB

It appears that you have this working fine, using Amazon S3 to trigger an AWS Lambda function that then loads the data into Amazon S3.

S3 to Redshift

There is no reason to copy data from DynamoDB to Redshift if it is exactly the same data that was originally loaded from S3. Simply use the same files to load the data into Redshift, but do it as a batch at regular intervals.

The best way to load data into Amazon Redshift is in bulk via the COPY command. Therefore, I would recommend:

  • Use a Lambda function (potentially the same one as used for DynamoDB) to collate a list of all incoming files into Amazon S3. You could store this in a DynamoDB table (or somewhere else of your own choosing).
  • Use an Amazon CloudWatch Events schedule to trigger an AWS Lambda function at regular intervals (every 15 minutes) which will:
    • Create a manifest file of all files received since the last load
    • Run a COPY command in Redshift, pointing to the manifest file
    • Mark those files in DynamoDB as having been processed (so that they will not be included in future executions)

Benefits of this method:

  • Separation of the two processes, which makes them easier to maintain
  • No need to read data from DynamoDB, which will reduce the Read Capacity Unit cost
  • You can vary the frequency of the load into Redshift as desired