1
votes

Is it possible to configure/code a Kafka consumer application to unilaterally implement "Exactly Once Semantics" to handle failure recovery (i.e., resume where left off after a comm failure, etc) independent of producer code (calling KafkaProducer methods, etc)?

After some googling, it appears all the "Exactly Once Semantics" (EOS) demos I've found (at least so far) involve calling methods on both producer and consumer instances within the same application to accomplish this.

Here's an example: https://www.baeldung.com/kafka-exactly-once

Can an independent consumer/client application be configured for EOS failure recovery/resume - independent of producer code (i.e., calling KafkaProducer methods, etc)?

If so, can you point me to an example?

2

2 Answers

4
votes

No, an independent consumer can not be configured to consume messages from Kafka exactly-once.

You can either have it as "at-most-once" or "at-least-once". Making it exactly-once highly depends on what the consumer is doing with the data and how and when you commit the messages back to Kafka.

You would have to implement this on your own. As an example you could have a look at the implementation of Spark Structured Streaming (also: spark-sql-kafka library) which makes use of write-ahead-logs in order to ensure exactly-once semantics.

1
votes

Although the other answer is correct, I would state briefly this in a slightly different fashion:

  • the target / sink needs to be idempotent (KV store or UPSert to something like KUDU)
  • and the source replayable.

Quoting from this blog explains it well imho, https://www.waitingforcode.com/apache-spark-structured-streaming/fault-tolerance-apache-spark-structured-streaming/read:

"...

Indeed, neither the replayable source nor commit log don't guarantee exactly-once processing itself. What if the batch commit fails ? As told previously, the engine will detect the last committed offsets as offsets to reprocess and output once again the processed data to the sink. It'll obviously lead to a duplicated output. But it'd be the case only when the writes and the sink aren't idempotent.

An idempotent write is the one that generates the same written data for given input. The idempotent sink is the one that writes given generated row only once, even if it's sent multiple times. A good example of such sink are key-value data stores. Now, if the writer is idempotent, obviously it generates the same keys every time and since the row identification is key-based, the whole process is idempotent. Together with replayable source it guarantees exactly-once end-2-end processing.

..."

As an English native speaker not 100% sure the don't is correct, but I think we can get the drift.