Can some please help me in this calculation. Is this the way we calculate for pricing ?
We have a stream which creates an average of 8 million records/ hour (2500 records/second) and we want to insert into DynamoDB using aws Lambda.
Each records have max of 450 Bytes in size. What I understand is for less than 1 KB , Dynamodb takes 1 Write capacity unit. So for 2500 records / second we need 2500 Write capacity unit.
Data get inserted to DynamoDB in real time. For analysis purpose , we need the same data to be in Redshift as well. So every 15 minutes we need to sync Dynamodb table to Redshift table. Business want to do Analytics on top Redshift cluster only.
To do a copy data from DynamoDB to redshift using redshift COPY command in 15 minutes I need approximate 10000 Read capacity unit.
For doing this it will cost approx USD 2400/month
More info
Existing architecutre
S3 => Lambda => Redshift : This works fine till lambda , when it comes to redshift , jobs are getting into Queue, and other Jobs are getting delayed. So we checked with Amazon Tech support : They clearly mention , for real time or near real time Injection , Redshift is not a good choose. Instead of Lambda we use EMR clusster , created manifest file ( adding multiple files) and did a redshift copy - Still the result was satisfied. Business dint want to increase the cluster right now.
Proposed architecture
S3 => Lambda => DynamoDB (as soon as file Landed )=> Every 15 miutes => Query data from Dynamo DB and Save to S3 => Redshift
The ask from business is how much cost is Every 15 minutes we are querying Dynamodb data with record count of 4 million . Save it to S3 Do a Redshfit Copy with that S3.
If I give a read unit as 10,000 - can the read from Dynamodb(4.5 million) completes in 7 minutes
10000 * 60(seconds) = 6,00,000 records
6,00,000 * 7 minutes = 4.2 million
My calculation is
7 minutes to read from Dynamo DB
3 minutes to write to S3
3 minutes to do redshift COPY
In short Business needs the data to be in Dynamodb as soon as file landed in S3. They can wait 15 minutes to reflect in Redshift.
Any other suggestion on this