Consuming data from Kafka topics, both Flink and SparkStreaming provides checkpointing mechanism provided that auto.commit.enabled is set to false. Spark docs say:
Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output or store offsets in an atomic transaction alongside output.
But Flink docs say:
With Flink’s checkpointing enabled, the Flink Kafka Consumer will consume records from a topic and periodically checkpoint all its Kafka offsets, together with the state of other operations, in a consistent manner. In case of a job failure, Flink will restore the streaming program to the state of the latest checkpoint and re-consume the records from Kafka, starting from the offsets that were stored in the checkpoint.
Reading other sources I guess Flink Checkpointing will save the state of the program as well as consuming offset but Spark Checkpointing just saves consuming offsets and because of that Spark say:
Spark output operations are at-least-once.
Can anyone says what is the differences and how someone can reach exactly-once semantic in reading data from Kafka topics?