2
votes

I have a kakfa topic with 20 partitions and 5 conusmers belonging to the same consumer group. It means that we have 4 partitions per consumer. Lets say:

  • consumer-0 is assigned to the partition-0, partition-1, partition-2 and partition-3
  • consumer-1 is assigned to the partition-4, partition-5, partition-6 and partition-7
  • consumer-2 is assigned to the partition-8, partition-9, partition-10 and partition-11
  • consumer-5 is assigned to the partition-12, partition-13, partition-14 and partition-15
  • consumer-4 is assigned to the partition-16, partition-17, partition-18 and partition-19

The producer evenly send 10 messages to the topic. In this case, only partitions 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are going to receive messages. The remaining ones will be empty. Our problem is that consumer-0 and consumer-1 will process 4 messages and in the same time, consumer-2 will process two messages. Also, consumer 4 and 5 will do any treatement since their partitions are idle.

At the producer side, we are working with the DefaultPartitioner (kafka-client 2.3.1) so that the record are evenly sent to the partitions. We would like to ask if it is possible to produce messages fairly based on kafka consumer rather than partitions. With this manner, each consumer will process only two messages and the process complexity will be fairly distributed between consumers.

2

2 Answers

1
votes

I think the calculations you made are non-relevant, because there's no scenario only 10 messages will be sent, and if this is really the situation you should consider using less partitions and relatively less consumers in the consumer group.
You can assume that for larger number of records in the stream, your producer will distribute the load roughly evenly between partitions and therefore between consumers, and now you don't care if consumer-1 received 1000 records and consumer-2 received 998.

Remember also that if the loads are changing, and for lower phases you don't won't consumers to be idle but to handle the same loads, this is completely OK that some consumers gets 4 messages, others 2, and others 0, because processing 4 messages is basically being kind of "idle" in relation to the loads you are expecting, and these differences are so minor they doesn't really count; so let Kafka do the magic for the higher loads when process power/time really matters.

0
votes

In general, I do not think this is a good design trying to force a producer to partition the data based on the consumer. A Kafka topic should seperate the dependencies between a producer and a consumer and encapsulate them from each other.

Two main reasons to not try to achieve this:

  • a Kafka topic is meant to be consumed by multiple consumer groups and they are (hopefully) all independent of each other in terms of consumer threads.
  • a consumer group and its consumers is not stable as one of them could die and a rebalance could happen. It is then required to have a sticky partition assignment strategy that adds more conplexity to your consumer. However, what if one of the 5 consumers dies forever? You would not be able to read the message of its four partitions. Remember a consumer group is a "moving thing" and I recommend to let Kafka habdle it as much as possible.

I understand this might not actually answer your question. If you want proper balancing you should match the number of partition with consumer threads and ensure on the producer side that all messages are produced in a balanced way accross the partitions.

Remember that even when using the DefaultPartitioner with as many topics as 20 you can still end up producing the data unbalanced as it depends oh the hash value of your keys.