I am reading the documentation about flink exactly-once feature here. And I do not quite understand some of the sentences:
After a successful pre-commit, the commit must be guaranteed to eventually succeed – both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.
This says data loss occurs if the commit does not eventually succeed. I interpret it as: The commit could succeed but it just happen to keep failing for every restart because of certain reason. In this case, Flink can only give up on the data belonging to this commit. So, if data loss is unacceptable, the application should be restarted until the commit succeeds ?
As we know, if there’s any failure, Flink restores the state of the application to the latest successful checkpoint. One potential catch is in a rare case when the failure occurs after a successful pre-commit but before notification of that fact (a commit) reaches our operator. In that case, Flink restores our operator to the state that has already been pre-committed but not yet committed.
I do not quite follow here, either. What's this notification about, which is not mentioned above ? And does the said operator mean the sink operator ? Also, as I interpret it, if the commit has succeeded and only the so-called notification fails, would it cause data duplication after restoration to the pre-commited state ?
Please correct me if the question itself is not valid. Any help is appreciated.