2
votes

I'm reading Kafka documentation about consumers and faced the following message consumption definition:

Our topic is divided into a set of totally ordered partitions, each of which is consumed by exactly one consumer within each subscribing consumer group at any given time. This means that the position of a consumer in each partition is just a single integer, the offset of the next message to consume.

I interpreted the wording as follows:

A consumer group reads data from a topic consisting of a number of partitions. Then each consumer from the group is assigned with some subset of partitions that do not overlap with other consumer's partitions from the group.


Consider the following case:

A consumer group GRP consisting of 2 consumers C1 and C2 reads data from a topic TPC consisting of 2 partitions P1 and P2.

QUESTION: If at some point C1 reads from P1 and C2 reads from P2 can it be rebalanced so that C1 starts reading from P2 and C2 from P1. If so under which condition may that happen?

It does not contradict to the quote above.

1
IIUC: 1. if both consumers lose connection simultaneously, they will be reassigned and might end up that way 2. programmatically maybe, if you have some weird partitioner that does that.daniu
@daniu Then I can imagine the following case: C1 read some message from P1 without commiting offset. Then it loses the connection to Kafka and processes the message succesfully. At the same time the other consumer C3 is created and assigned to the P1 reading the same message. That might be a problem...Some Name

1 Answers

1
votes

I see a few things to be discussed in your question and comment.

  1. Your interpretation of the quoted paragraph is correct.

  2. Question "If so under which condition may that happen?": Yes, this scenario can happen. A change in the assignment of a consumer to a TopicPartition is mainly triggered through a rebalancing. A consumer rebalance will be triggered in the following cases:

Consumer rebalances are initiated when

  • A Consumer leaves the Consumer group (either by failing to send a timely heartbeat or by explicitly requesting to leave)

  • A new Consumer joins the Consumer Group

  • A Consumer changes its Topic subscription

  • The Consumer Group notices a change to the Topic metadata for any subscribed Topic (e.g. an increase in the number of Partitions)

[Source: Training Material of Confluent Kafka Developer]

Keep in mind, that during a Rebalance all consumers are paused.

  1. Your comment "C1 read some message from P1 without commiting offset. Then it loses the connection to Kafka and processes the message succesfully. At the same time the other consumer C3 is created and assigned to the P1 reading the same message."

I see this scenario unrelated to a consumer rebalance, as your consumer C1 could just die after processing the data but before committing the back to Kafka. Now, if you restart the consumer C1 it will read the same messages again because it did not yet commit them.

This is called "at-least-once" delivery semantics and is different to "at-most-once" semantics when you have e.g. auto.commit enabled. I guess you are looking for the "holy grail" in distributed systems which is "exactly-once-semantics" :)

For this to achieve you need to consider the entire application from Kafka to the sink of your application. If the output of your application is not idempotent you are likely not able to achieve exactly-once semantics (EOS). But if your output sink e.g. is Kafka again you actually can achieve EOS.