0
votes

I have a producer application which writes to Kinesis stream at rate of 600 records per sec. I have written an Apache flink application to read/process and aggregate this streaming data and write the aggregated output to AWS Redshift.

The average size of each record is 2KB. This application will be running 24 * 7.

I wanted to know what should be the configuration of my AWS EMR Cluster. How many nodes do i require ? What should be the EC2 instance type (R3/C3) that I should be using.

Apart from the performance aspect, cost is also important for us.

1

1 Answers

1
votes

Whether to go for r3/c3 depends on a number of resources your application is using.

I assume that you are using windowing or some stateful operator to perform the aggregation. A stateful operator will maintain the state in the StateBackend configured https://ci.apache.org/projects/flink/flink-docs-release-1.3/ops/state_backends.html#state-backends

So you can first check if the state fits in memory(if you intend to use FSStateBackend) by trying out your application on c3 type instances. You can check the memory utilization using JVisualVM. Also, try to the check the CPU utilization here.

With r3 type instances, you will get more memory with the same number of CPU that c3 provides. For Ex: c3.4xlarge instances provides 16 vCPU with 30GB memory per node whereas r34xlarge provides 16vCPU with 122GB memory per node.

So, it depends on your application what type of instances you should be using.

For the price comparison you can refer this : http://www.ec2instances.info/