So I have a design where I have multiple producers P1, P2, P3, P4 ... PN writing to a single topic T1, that has 32 partitions.
On the other side I have up to 32 consumers on a single consumer group.
I would like to load balance my message consumption.
Reading the docs I could see 3 options:
1. Define the partition myself (drawback I would have to know where the last message was sent or define a partition range for each Producer P)
2. Define a key and leave the partition decision to the Kafka hash algorithm (drawback - load balancing would be defined on luck)
(As per Chris answer the load balancing should be left to hash algorithm) -the reality shows this does not provide equal distribution to the consumers as the consumers are bound to partitions and I would have to understand the hash algorithm to chose a good key - which to me sound the same as picking the partition (and that would have to be distributed over the producers)
My current code is using UUID as the key. The analysis of the partitions chosen, and consequently the consumers working, shows a distribution that may be far from being equal. I'm reproducing it below:
The image above shows the number of messages received by each partitions in a 5 minutes window using UUID as my key - at that point in time I had 8 consumers.
The consumption takes about 2 minutes. The cells in red shows a 9 request queue in one of the consumers, while other consumers had low loads - or zero load like the consumer in green.
If a random key is not a good option, what should I chose?
- No partition, no key and leave to the Kafka round robin algorithm (drawback the round robin is internal to the Producer - meaning all producers could be sending the message to the same partition - I also tested this option and the result is below:
The image above shows round robin is, apparently, internal to the producer.
Do I really need to write the overall load balancing algorithm myself? Am I missing something?