0
votes

I have configured a two node Kafka cluster on AWS, and am currently testing its performance attributes.

I used the kafka-consumer-perf-test.sh to read 50 million messages from a Kafka topic using a single thread.

I observed the following while testing consumer throughput.

Observation 1

Single consumer on a m4.large EC2 instance - Read throughput 40.2MB/sec

Three consumers on three seperate m4.large EC2 instances - Individual read throughput - 40.25MB /sec.

No disk reads or writes were reported on the Kafka broker side ( running on two seperate m4.2xlarge EC2 instances backed by 2 EBS volumes )

I ran the tool again after some time.

Observation 2

Three consumers on three seperate m4.large EC2 instances - Individual read throughput dropped to - 34.25MB /sec.

In this case I observed a considerable rate of disk reads in Kafka broker nodes.

I would appreciate it if you could help me clarify the following.

  1. In observation 1, since I did not observe any disk reads, could all data have been fetched from the Memory ( where it is cached ) ?.

  2. In observation 2, I assume performance dropped due to the disk reads. Although it read from the disk, still enough free memory was available as revealed by nmon reports.

What could have been the reason for consumers to read from disk instead of memory? For how long does data produced by producers stay in the cache?

  1. I assume maximum 40MB/s read throughput is due to the network bandwidth limitation for m4.large EC2 instance. Is this assumption correct?

Thank you.

1

1 Answers

1
votes

It is recommended to tune the Linux kernel parameter vm.swappiness = 1 to ensure the best use of page cache for reads and avoid disk I/O.

See https://en.m.wikipedia.org/wiki/Swappiness

Also it is recommended to run Kafka stand alone on its own vm or physical server so that all the available RAM will get used for page cache.

Confluent just published AWS based benchmark results for AK 0.11 (Confluent 3.3) that includes tests with and without page cache hits if you want a comparison

The benchmark is a link in this blog post

https://www.confluent.io/blog/we-will-say-exactly-confluent-platform-3-3-available-now/

The benchmark results are here

https://docs.google.com/spreadsheets/u/1/d/1dHY6M7qCiX-NFvsgvaE0YoVdNq26uA8608XIh_DUpI4/htmlview