24
votes

I'm studying distributed systems and referring to this old question: stackoverflow link

I really can't understand the difference between exactly-once, at-least-once and at-most-once guarantees, I read these concepts in Kafka, Flink and Storm and Cassandra also. For instance someone says that Flink is better because has exactly-once guarantees while Storm has only at-least-once.

I understand that exactly-once mode is better for latency but at the same time it's worse for fault tolerance right? How can recover a stream if I haven't duplicates? and then... if this is a real problem, why exactly-once guarantee is considered better than others?

Someone can give me better definitions?

4
Take a look at this section of Kafka documentation and let us know if that clarifies your doubts.Luciano Afranllie

4 Answers

55
votes

Below definitions are quoted from Akka Documentation

at-most-once delivery

means that for each message handed to the mechanism, that message is delivered zero or one times; in more casual terms it means that messages may be lost.

at-least-once delivery

means that for each message handed to the mechanism potentially multiple attempts are made at delivering it, such that at least one succeeds; again, in more casual terms this means that messages may be duplicated but not lost.

exactly-once delivery

means that for each message handed to the mechanism exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.

The first one is the cheapest—highest performance, least implementation overhead—because it can be done in a fire-and-forget fashion without keeping state at the sending end or in the transport mechanism. The second one requires retries to counter transport losses, which means keeping state at the sending end and having an acknowledgement mechanism at the receiving end. The third is most expensive—and has consequently worst performance—because in addition to the second it requires state to be kept at the receiving end in order to filter out duplicate deliveries

5
votes

Here is an aggressive article worth reading.

I will try to answer your questions:

  • Exact-once is not fault tolerant in large distributed systems, because it is impossible for all systems to agree on each message if some of systems may fail. You can implement exact once, but it will be on top of at-least-once with your own costly coordination. Think about how TCP ensures reliable data transfer when the underlying IP protocol is not reliable.
  • By implementing exact-once on top of at-least-once, you will have duplicates (if not exact one) in case of failures and what you need is to de-duplicate.
  • Exact-once is not considered better because it comes with high cost, whereas at-least-once is good enough in most circumstances.
3
votes

Flink uses these terms to talk about the effects that events have on application state. Suppose I'm trying to count posts to stackoverflow with the tag apache-flink in daily windows. If I am working with exactly once guarantees, then each post will be counted exactly once and my analytics will be 100% correct, even if there's a failure along the way and some data has to be reprocessed to make that happen. Flink accomplishes this with a combination of globally consistent snapshots and stream replay. With at least once, then if there's a failure some posts may be counted twice, but I'm guaranteed that every post will be analyzed by the pipeline. And with at most once there will be no snapshotting and no replay in the event of a failure, which will lead to undercounting posts if something goes wrong.

Exactly-once is optimal in terms of correctness and fault tolerance, but comes at the expense of a bit of added latency.

For a much more in-depth treatment of this subject, see this blog post from data Artisans -- High-throughput, low-latency, and exactly-once stream processing with Apache Flink™ -- and the documentation of Flink's internals.

3
votes

I've found a great website where all (or most) Cloud Computing Patterns are succinctly discussed. I really recommend it to you, take a look: http://www.cloudcomputingpatterns.org

Exactly-once Delivery

For many critical systems duplicate messages are inacceptable. The messaging system ensures that each message is delivered exactly once by filtering possible message duplicates automatically.

At-least-once Delivery

In case of failures that lead to message loss or take too long to recover from, messages are retransmitted to assure they are delivered at least once.