0
votes

I am new to Kafka and think I am missing something on how partition queues get balanced on a topic

We have 5 partitions and 2 consumers on a topic. The topic has a null key so I assume Kafka randomly picks a new partition to add the new record to in a round robin fashion.

This would mean one consumer would be reading from 3 partitions and the other 2. If my assumption is right (that the records get evenly distrusted across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.

I think you should have an even divisible number of partitions to consumers.

Am I missing something?

4

4 Answers

2
votes

The unit of parallelism in consuming Kafka messages is the partition. The routine scenario for consuming Kafka messages is getting messages using a data stream processing engine like Apache Flink, Spark, and Storm that all of them distributed processing on CPU cores. The rule is the maximum level of parallelism for each consumer group can be the number of partitions. Each consumer instance of a consumer group (say CPU cores) can consume one or more partitions and on the other hand, each partition can be consumed by just one consumer instance of each consumer group.

  • If you have more CPU core than the number of partitions, some of them will be idle.
  • If you have less CPU core than the number of partitions, some of them will consume more than one partitions.
  • And the optimized case is when the number of CPU cores and Kafka partitions are equal.

The image can describe all well: enter image description here

0
votes

If my assumption is right (that the records get evenly distributed across partitions) the consumer with 3 partitions would be doing more work (1.5x more). This could lead to one consumer doing nothing while the other keeps working hard.

Why would one consumer do nothing? It would still process records from those 2 partitions [assuming of course, that both the consumers are in same group]

I think you should have an even divisible number of partitions to consumers.

Yes, that's right. For maximum parallelism, you can have as many number of consumers, as the #partitions, e.g. in your case 5 consumers would give you max parallelism.

0
votes

Your understanding is correct. May be there is data skew. You can check how many records are there in each partition by using offset checker or other tool.

0
votes

There is an assumption built into your understanding that each partition has exactly the same throughput. For most applications, though, that may or may not be true. If you set up your keying/partitioning right, then the partitions should hopefully be close to equal, especially with a large and diverse keyspace if you average them out over a large period of time. But in a more practical, realistic sense, you'll probably have some skew at any given time anyway, and your stream processing setup will need to tolerate that. So having one more partition assigned to a particular consumer is probably not going to make a big difference.