About Flink exactly-once feature

Question

I am reading the documentation about flink exactly-once feature here. And I do not quite understand some of the sentences:

After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

This says data loss occurs if the commit does not eventually succeed. I interpret it as: The commit could succeed but it just happen to keep failing for every restart because of certain reason. In this case, Flink can only give up on the data belonging to this commit. So, if data loss is unacceptable, the application should be restarted until the commit succeeds ?

As we know, if there’s any failure, Flink restores the state of the application to the latest successful checkpoint. One potential catch is in a rare case when the failure occurs after a successful pre-commit but before notification of that fact (a commit) reaches our operator. In that case, Flink restores our operator to the state that has already been pre-committed but not yet committed.

I do not quite follow here, either. What's this notification about, which is not mentioned above ? And does the said operator mean the sink operator ? Also, as I interpret it, if the commit has succeeded and only the so-called notification fails, would it cause data duplication after restoration to the pre-commited state ?

Please correct me if the question itself is not valid. Any help is appreciated.

Fabian Hueske Fabian Hueske · Accepted Answer · 2019-02-03T09:08:22

Flink's end-to-end exactly-once mechanism is based on a two phase commit (2PC) like protocol. The protocol is used to coordinate that either none or all sinks of a program commit output to an external system.

When a sink task says "I am ready to commit" (pre-commit), it gives the guarantee that it is able to perform the commit. The sink task then waits to receive a commit notification from the coordinator which is only sent if all sink tasks agreed on being ready to commit. The guarantee must also hold if the application fails before the notification is received. In that case, the sink task must be able to recover the open (not-yet-committed) transaction and execute it when the next notification is received. In case of multiple failures, the sink must keep on trying until the commit succeeds. However, the transaction must only be performed once, even in case of one (or more) failure.

That's what is meant by

After a successful pre-commit, the commit must be guaranteed to eventually succeed

If a sink task is not able to eventually commit data that it pre-committed, the data is lost.

About Flink exactly-once feature

1 Answers