9
votes

I see in some answers around stack-overflow and in general in the web the idea that Kafka does not support consumption acknowledge or that exactly once consumption is hard to achieve.

In the following entry as a sample Is there any reason to use RabbitMQ over Kafka?, I can read the following statements:

RabbitMQ will keep all states about consumed/acknowledged/unacknowledged messages while Kafka doesn't

or

Exactly once guarantees are hard to get with Kafka.

This is not what I understand by reading the official Kafka documentation at: https://kafka.apache.org/documentation/#design_consumerposition

The previous documentation states that Kafka does not use a traditional acknowledge implementation (as RabbitMQ). Instead they rely on the relationship partition-consumer and offset...

This makes the equivalent of message acknowledgements very cheap

Could somebody please explain why "only once consumption guarantee" in Kafka is difficult to achieve? and How this differs from Kafka vs other more traditional Message Broker as RabbitMQ? What am I missing?

1
What you mean by "only once consumption guarantees" is usually called "exactly-once".Matthias J. Sax
1. AFAIK, consuming exactly once is consumer's problem. Apache Kafka is a broker supporting pub-sub paradigm. So it traditionally does not keep track of offset for consumer. It is upto consumer to manage how to consume. e.g. For stream processing the vanilla Apache Storm guarantees at least once while Apache Trident is exactly once counterpart....contd.Albatross
2. I mentioned Storm as I have interfaced Kafka-Storm. There can be ways to achieve exactly once using Kafka. But I am not sure it is Kafka's (for that matter any broker's) responsibility to achieve that on consumer behalf.Albatross

1 Answers

19
votes

If you mean exactly once the problem is like this. Kafka consumer as you may know use a polling mechanism, that is consumers ask the server for messages. Also, you need to recall that the consumer commit message offsets, that is, it tells the cluster what is the next expected offset. So, imagine what could happen.

Consumer poll for messages and get message with offset = 1.

A) If consumer commit that offset immediately before processing the message, then it can crash and will never receive that message again because it was already committed, on next poll Kafka will return message with offset = 2. This is what they call at most once semantic.

B) If consumer process the message first and then commit the offset, what could happen is that after processing the message but before committing, the consumer crashes, so in that case next poll will get again the same message with offset = 1 and that message will be processed twice. This is what they call at least once.

In order to achieve exactly once, you need to process the message and commit that offset in an atomic operation, where you always do both or none of them. This is not so easy. One way to do this (if possible) is to store the result of the processing along with the offset of the message that generated that result. Then, when consumer starts it looks for the last processed offset outside Kafka and seek to that offset.