Kafka Partition and Throughput

Question

i have introductory experience with kafka and I am trying to explore its details.

I am trying to understand how kafka partitions can help improving throughput; in all information i found online; it is explained that more partition means more parallel streams; which make sense.

How ever with different point of view it does not.

lets say i have two consumers which consumes data at "10"messages per second from given topic. now no mater they are consuming from single partition or two different partitions; my throughput will remain same 20 messages per second.

i feel like i must be missing some details on inner workings can you help me by explaining how kafka partitions (more than one) can help improving throughput for fixed number of consumers Vs single kafka partition.

imagine you hit the maximum a thread can read and process from kafka during a second, which is 1000 msgs/sec(just an example). Now imagine you have producers making 2000 messages/second into the topic: you need a second thread to be able to consume at that level. Rule in Kafka: One partition-One thread(for the same consumer group), so you put a second partition. Without it, you wouldn't be able to read at 2000 msgs/sec — aran

JR ibkr JR ibkr · Accepted Answer · 2019-02-26T22:58:14

https://kafka.apache.org/intro

When I started to learn kafka; I had the same question. Following explanation will help you to answer your question:

Let's say you have a topic A with 3 partitions: X, Y & Z.

First thing to understand is how data is distributed across partitions:

Producer can choose in which partition a message will go. So your producer can send message#1 to partition-X, message#2 to partition-Y and message#3 to partition-Z. In the same way, other producers can choose in which partition data will be written. If your producer does not choose a partition then kafka will choose for you. For more information; please checkout producer API. Producer should never push message#1 to partition-X, partition-Y & partition-Z. You can create replicas to provide fault-tolerance. Partitions are not replicas.

Now, a consumer subscribes to your topic. Kafka will see how many consumers are active within a consumer group. It may allocate a partition to a consumer as following:

(in the image; P0, P1, P2 and P3 are partitions. Consumer group A has C1 & C2 consumers. C1 listens to P0, P3 and C2 listens to P1 and P2. In the end, your consumer group A will receive data from all partitions.)

If your consumer group had 3 consumers and you add one new consumer then it will sit ideal. No of consumers in consumer-group <= number of partitions.
If your consumer group had 2 consumers and you add a new one then rebalance will be triggered. Kafka will assign one partition to your consumer.
If this is brand new consumer-group then kafka will assign all partitions to this new consumer.

Now let's assume; your consumer is single-threaded and it takes about 1 second to process a message then your throughput would be 1 msg/second in case#3.

In case#2; it would be 3 msg/second. Because each consumer is listening to different partition and processing data.

In case#1; you won't get any benefit.

Kafka Partition and Throughput

2 Answers