Restore Apache Flink job from checkpoint

Question

I'm using Apache Flink + RabbitMQ stack. I know about opportunity to manually trigger savepoints and restore jobs from them, but the problem is that Flink acknowledges messages after successful checkpoint, and if you want to make savepoint and restore state you're losing all data between last successful savepoint and last successful checkpoint. Is there a way to restore job from checkpoint? That would solve the problem of losing data in case of non-replayable data sources (like rabbitmq). Btw, if we have checkpoints with all their overheads, why don't let users to use them?

Fabian Hueske Fabian Hueske · Accepted Answer · 2016-09-14T21:31:25

Conceptually, a savepoint is nothing else than a checkpoint plus a bit of metadata. In both cases (Savepoint and Checkpoint), Flink creates a consistent checkpoint of the state of all operators, source, and sinks.

Checkpoints are considered to be an internal mechanism for failure recovery. However, checkpoints can be configured to be externalized checkpoints. Externalized checkpoints are not automatically cleaned up when a job terminates and can be used to manually restart a program.

Your problem with the RabbitMQ source is that it kind of violates Flink's checkpointing semantics, because it pushes some state to an external system by acking on checkpoint which cannot be reset.

Would a mechanism to trigger a savepoint and immediately shutdown the job afterwards solve your problem? This would prevent that a checkpoint is triggered after a savepoint was taken.

Restore Apache Flink job from checkpoint

1 Answers