1
votes

while i understand the pre-requisite of having co-partitioning as explained here Why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams? , I do not understand the mechanism that make sure that the partitions of each topic that have the same key, get assigned to the same KAFKA Stream. I do not see how the consumer group of KAFKA would enable that.

The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.

Somehow, the consumers in each consumer group, get assigned to partition that contains the same key. I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.

Can someone explain that part ?

2

2 Answers

2
votes

The way i understand it is that, we have 2 independent consumer groups, which actually may have the same name, because it is the same kafka stream application, although the suscription to each topic is independent from each other.

All members of a consumer group have the same "name" (ie, group.id) -- it is not possible to have two consumer groups with the same name. It would be one consumer group.

although the suscription to each topic is independent from each other

For KafkaConsumer it's possible to have different subscription for different members in the group (even if this should be a very rare scenario). For Kafka Streams however, it is required that all members of the group (ie, application instances) execute the exact some Topology with the exact some input topics (ie, their subscription must be the same).

I did not know that consumer assignment to partition could be related to the content of the partitions. So far i though it was random.

That is correct.

From your own answer:

In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.

That is also correct.

Note, that Kafka Streams uses a special partition assignor (not the default ones the consumer offers) to ensure co-partitioning, stickiness (ie, state-store awareness), and to assign standby-tasks.

1
votes

After refreshing I found the two following statement that explains it all:

A consumer group has a unique id. Each consumer group is a subscriber to one or more Kafka topics.

Hence a consumer group may involve multiple topics with their partition and a strategy to assign them to the consumer of the group.

PARTITION.ASSIGNMENT.STRATEGY (In Kafka Definitive Guide)

A PartitionAssignor is a class that, given consumers and topics they subscribed to, decides which partitions will be assigned to which consumer. By default, Kafka has two assignment strategies:

  • Range: Assigns to each consumer a consecutive subset of partitions from each topic it subscribes to. So if consumers C1 and C2 are subscribed to two topics, T1 and T2, and each of the topics has three partitions, then C1 will be assigned partitions 0 and 1 from topics T1 and T2, while C2 will be assigned partition 2 from those topics. Because each topic has an uneven number of partitions and the assignment is done for each topic independently, the first consumer ends up with more partitions than the second. This happens whenever Range assignment is used and the number of consumers does not divide the number of partitions in each topic neatly.

In other words, if the number of partitions is the same, and the partition strategy of each producer of the topic is the same, message with same key will be assigned in the same way on the partition range, which is assigned to the consumer in the same way, i.e. as consecutive subset of partitions from each topic. Hence The same stream task will always have partitions of both topics which have the same key.