0
votes

I am trying to come up with a design using Kafka for a number of processing agents to process messages from a Kafka topic in parallel.

I would like to ensure close to exactly-once per message processing across the whole consumer group, although can tolerate at-least-once.

I find the documentation unclear in many regards, and there are a few specific questions I have to know if this is a viable approach:

  • if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
  • is the "offset" per partition or per consumer/consumergroup/partition?
  • when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
  • if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?

Or are there any other points I am missing that may help my understanding of this?

1

1 Answers

3
votes

if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.

[A]: the partition is replicated across nodes depending on replication factor. if you have partition P1 in a broker with 2 nodes and replication factor of 2, then, node1 will be primary leader for P1 and node2 will also have the P1 contents/messaged but it will be the replica (and replication happens in async manner)

is the "offset" per partition or per consumer/consumergroup/partition?

[A]: per partition from a broker standpoint. its also per consumer since 'offset' is explicitly tracked/managed on the consumer end. The consumer code can delegate this work to Kafka or manage the offsets manually

when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?

[A]: kafka would trigger a rebalance when a new consumer enters the group and assign certain partitions to it. from there on, the consumer will only care about the offsets of the partitions which it is responsible for

if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?

[A] for parallelism, the ideal scenario is to have 1-1 mapping b/w consumer and partition e.g. if you have 10 partitions, you can have at max 10 consumers. If you bring in the 11th one, kafka wont assign partitions to it unless an existing consumer leaves the group