2
votes

i have introductory experience with kafka and I am trying to explore its details.

I am trying to understand how kafka partitions can help improving throughput; in all information i found online; it is explained that more partition means more parallel streams; which make sense.

How ever with different point of view it does not.

lets say i have two consumers which consumes data at "10"messages per second from given topic. now no mater they are consuming from single partition or two different partitions; my throughput will remain same 20 messages per second.

i feel like i must be missing some details on inner workings can you help me by explaining how kafka partitions (more than one) can help improving throughput for fixed number of consumers Vs single kafka partition.

2
imagine you hit the maximum a thread can read and process from kafka during a second, which is 1000 msgs/sec(just an example). Now imagine you have producers making 2000 messages/second into the topic: you need a second thread to be able to consume at that level. Rule in Kafka: One partition-One thread(for the same consumer group), so you put a second partition. Without it, you wouldn't be able to read at 2000 msgs/secaran

2 Answers

5
votes

https://kafka.apache.org/intro

When I started to learn kafka; I had the same question. Following explanation will help you to answer your question:

Let's say you have a topic A with 3 partitions: X, Y & Z.

First thing to understand is how data is distributed across partitions:

Producer can choose in which partition a message will go. So your producer can send message#1 to partition-X, message#2 to partition-Y and message#3 to partition-Z. In the same way, other producers can choose in which partition data will be written. If your producer does not choose a partition then kafka will choose for you. For more information; please checkout producer API. Producer should never push message#1 to partition-X, partition-Y & partition-Z. You can create replicas to provide fault-tolerance. Partitions are not replicas.

Now, a consumer subscribes to your topic. Kafka will see how many consumers are active within a consumer group. It may allocate a partition to a consumer as following:

Kafka partition distribution

(in the image; P0, P1, P2 and P3 are partitions. Consumer group A has C1 & C2 consumers. C1 listens to P0, P3 and C2 listens to P1 and P2. In the end, your consumer group A will receive data from all partitions.)

  1. If your consumer group had 3 consumers and you add one new consumer then it will sit ideal. No of consumers in consumer-group <= number of partitions.
  2. If your consumer group had 2 consumers and you add a new one then rebalance will be triggered. Kafka will assign one partition to your consumer.
  3. If this is brand new consumer-group then kafka will assign all partitions to this new consumer.

Now let's assume; your consumer is single-threaded and it takes about 1 second to process a message then your throughput would be 1 msg/second in case#3.

In case#2; it would be 3 msg/second. Because each consumer is listening to different partition and processing data.

In case#1; you won't get any benefit.

3
votes

I think your first misunderstanding is in

10 messages per second from given topic.

In Kafka, a topic is not really a concrete thing. You should instead see it as collection of partitions that have the same name and configuration.

Then in

lets say i have two consumers which consumes data at "10"messages per second from given topic. now no mater they are consuming from single partition or two different partitions; my throughput will remain same 20 messages per second.

This is not entirely correct, especially when considering Consumer Groups which is a key feature of Kafka.

If you have a single partition, you can't have multiple consumers in the same group consuming at the same time. If your consumer are in different groups, each consumer will all receive all messages. By having multiple partitions, you're able to have multiple consumers running at the same time.

For example with 2 partitions, you can have 2 consumers running in the same group, consumer 1 receives records from partition 0 and consumer 2 from partition 1. If you only had a single partition, only 1 consumer could consume (per group).

In addition, partitions can be on different brokers which again helps for scalability.