2
votes

I have setup a sample Kafka cluster on AWS and am trying to identify maximum throughput possible with the given configurations. I am currently following post provided here for this analysis.

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

I would appreciate it if you could clarify the following issues.

I observed a throughput of 40MB/s for messages of size 512 bytes ( single producer - single consumer ) with given hardware. Assume I need to achieve a throughput of 80MB/s.

As I understand one way to do this to increase the number of partitions per topic and increase the number of threads in producer and consumer. ( Assuming I do not change the default values for batch size, compression ratio etc. )

  1. How to find the maximum throughput possible with given hardware? The point after which we are required to improve our hardware resources if we are to further improve the throughput?

( In other words how to make the decision "With X GB RAM and Y GB disk space this is the maximum throughput I can achieve. If I need to further improve the throughput I have to upgrade RAM to XX GB and disk space to YY GB" )

2.Should we scale the cluster vertically or horizontally? What is the recommended approach?

Thank you.

1

1 Answers

2
votes
  1. If we define throughput as the volume of data transmitted over the network per second, the maximum throughput should not exceed #machine number * bandwidth. Given a single machine whose NIC is configured with 1Gbps, the max TPS on single machine cannot be larger than 1Gbps. In your case, TPS is 40MB/s, namely 320Mbps,which is quite less than 1Gbps, meaning there is still room for improvement. However, if your target is far larger than 1Gbps, you definitely need more machines.

  2. AFAIK, bandwidth is the most likely cause for the system bottleneck. Unlike CPU and RAM, it's not easy to scale vertically, so a horizontally scaling might be an option.

You could do some maths before scaling. Say the throughput target is "produce 2 billion of records with 512Bytes in 1 hour". That's to say, the TPS has to achieve 2,000,000,000 * 8 * 512 / 3600 / 1024 / 1024 = 2170mbps. Assuming available bandwidth for single machine is 700mbps(Over 70% usage normally brings 'packet loss'), at least 4 machines should be planned for the producer application.