Differences in Checkpointing mechanism in Flink and SparkStreaming

Question

Consuming data from Kafka topics, both Flink and SparkStreaming provides checkpointing mechanism provided that auto.commit.enabled is set to false. Spark docs say:

Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output or store offsets in an atomic transaction alongside output.

But Flink docs say:

With Flink’s checkpointing enabled, the Flink Kafka Consumer will consume records from a topic and periodically checkpoint all its Kafka offsets, together with the state of other operations, in a consistent manner. In case of a job failure, Flink will restore the streaming program to the state of the latest checkpoint and re-consume the records from Kafka, starting from the offsets that were stored in the checkpoint.

Reading other sources I guess Flink Checkpointing will save the state of the program as well as consuming offset but Spark Checkpointing just saves consuming offsets and because of that Spark say:

Spark output operations are at-least-once.

Can anyone says what is the differences and how someone can reach exactly-once semantic in reading data from Kafka topics?

Jicaar Jicaar · Accepted Answer · 2017-12-12T15:55:09

I think this covers what you are looking for: https://data-artisans.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink

The big difference between exactly-once and at-least-once is that with exactly-once you are guaranteed not to have duplicate data outputted. At-least-once guarantees that you won't lose any data (same with exactly-once) but there could be duplicate data outputted.

Edit:

I should mention I am not as familiar with Spark as I am with Flink, but this is a major thing that Flink touches on, which is why I provided the big overview documentation link for it. But the concept of exactly-once vs at-least-once is universal and not technology dependent.

Differences in Checkpointing mechanism in Flink and SparkStreaming

2 Answers