1
votes

I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.

I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.

Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.

I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.

I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)

I also observed that disk write rate to EBS volume is around 120 MB/s.

Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )

Could this be due to the bandwidth limitation between broker and ebs volume?

I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.

What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?

What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).

Thank you

2

2 Answers

1
votes

I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.

1
votes

There are a couple things to consider:

  1. There are separate disk and network limits that apply to both the instance and the volume.
  2. You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.