23
votes

I'm reading the Kafka documentation and noticed the following line:

Note however that there cannot be more consumer instances in a consumer group than partitions.

Hmm. How can I auto-scale this?

For example let's say I have a messaging system with hi/lo priorities, so I create a topic for messages and partitions for hi and lo priority messages.

If this was RabbitMQ, I'd have an auto-scalable group of consumers assigned to each partition, like this:

enter image description here

If I understand the Kafka model I can't have >1 consumer per partition in a consumer group, so that picture doesn't work for Kafka, right?

Ok, so what about >1 consumer groups like this:

enter image description here

That get's around Kafka's limitation but... If I understand how this works both consumer groups would be pulling from a partition, for example msg.hi, with their own offsets so neither would know about the other--meaning messages would likely be delivered twice!

How can I achieve the capability I had in the Rabbit design w/Kafka and still maintain the "queue-ness" of the behavior (I don't want to send a message twice)? What am I missing?

3

3 Answers

7
votes

Just create a bunch of partitions for hi and lo. 12 is a good number. So is 60. Just pick a number of partitions that matches how much maximum parallelization you want.

Honestly, although I personally would makemsg.hi and msg.lo be different topics entirely, that's not a requirement -- you can do custom parititoning to divide messages between partitions.

12
votes

Topic is made up of partitions. Partitions decide the max number of consumers you can have in a group. enter image description here

In the above picture, we have only one consumer. It can read all the messages from all the partitions. When you increase the number of consumers in the group, partition reassignment happens and instead of consumer 1 reading all the messages from all the partitions, consumer 2 could share some of the load with consumer 1 as shown below. enter image description here What happens If I have more number of consumers than the number of partitions.? Each consumer would be assigned 1 partition. Any additional consumers in the group will be sitting idle unless you increase the number of partitions for a Topic.

enter image description here

So, we need to choose the partitions accordingly. That decides the max number of consumers in the group. Changing the partition for an existing topic is really not recommended as It could cause issues. That is, Lets assume a producer producing names into a topic where we have 3 partitions. All the names starting with A-I go to Partition 1, J-R in partition 2 and S-Z in partition 3. Lets also assume that we have already produced 1 million messages. Now if you suddenly increase the number of partitions to 5 from 3, It will create a different A-Z range now. That is, A-F in Partition 1, G-K in partition 2, L-Q in partition 3, R-U in partition 4 and V-Z in partition 5. Do you get it? It kind of affects the order of the messages we had before! So you need to be aware of this. If this could be a problem, then we need to choose the partition accordingly upfront.


More info is here - http://www.vinsguru.com/kafka-scaling-consumers-out-for-a-consumer-group/

9
votes

Your assumption about messages being consumed twice is correct (since each group consumes 100% of messages from a topic).
I agree with David. Moreover, I suggest that you create more partitions than you really need, which would leave you some headroom to increase the number of threads in the group when such a need arises.

You can always increase the number of partitions later (and/or add additional brokers), but it's nice to have that already done, so that you can only increase number of threads and be done with it (those situations usually require a quick response, so you should do all the prep. that you can do in advance).