Flink exact once streaming with S3 sink

Question

I am a newbie in Flink and I am trying to write a simple streaming job with exactly-once semantics that listens from Kafka and writes the data to S3. When I say "Exact once", I mean I don't want to end up to have duplicates, on intermediate failure between writing to S3 and commit the file sink operator. I am using Kafka of version v2.5.0, according to the connector described in this page, I am guessing my use case will end up to have exact once behavior.

Questions:

1) Whether my assumption is correct that my use case will endup to have exact once even though there is any failure occurring in any part of the steps so that I can say my S3 files won't have duplicate records?

2) How Flink handle this exact once with S3? In the documentation it says, it uses multipart upload to get exact once semantics, but my question is, how it is handled internally to achieve exact once semantics? Let's say, the task failed once the S3 multipart get succeeded and before the operator commit process, in this case, once the operator gets restarts will it stream the data again to S3 which was written to S3 already, so will it be a duplicate?

Dennis Jaheruddin Dennis Jaheruddin · Accepted Answer · 2020-06-14T13:44:52

If you read from kafka and then write to S3 with the StreamingDataSink you should indeed be able to get exactly once.

Though it is not specifically about S3, this article gives a nice explanation on how to ensure exactly once in general.

https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html

My key takeaway: After a failure we must always be able to see where we stand from the perspective of the sink.

Flink exact once streaming with S3 sink

1 Answers