Flink resume from externalised checkpoint question

Question

I am using Flink running inside ECS installed from docker-flink. I have enabled externalized checkpoint to AWS S3 via state.checkpoints.dir to S3 in flink-conf.yaml.

Now according to Flink documentation here if we want resume from a checkpoint in case of failure we have to say bin/flink run -s :checkpointMetaDataPath [:runArgs] but I use FLINK_HOME/bin standalone-job.sh start-foreground. So I am not able to figure out how my Flink job would resume from externalized checkpoint in case of failure.

Do we really need to have some config option option of resuming from checkpoint? Can't JM as part of restart strategy automatically read last offsets from state store? I am new to Flink.

Feedback: I've removed "hi" from some of your questions previously, because Stack Overflow is not a chatroom. There is a preference for technical writing here. With that in mind, I have also removed chatty material, trimmed some please-help-me pleading, fixed a spelling error, and added paragraphs. — halfer
Please have mercy on volunteer editors, by (a) using paragraph breaks (double Enter); (b) spell-checking your work; (c) refraining from pleading or writing tales of woe; (d) remembering that questions ideally are permanent on Stack Overflow, and are intended to help many readers into the future. Thank you! — halfer

Till Rohrmann Till Rohrmann · Accepted Answer · 2020-04-03T11:57:55

The referred Dockerfile alone won't start a Flink job. It will only start a Flink session cluster which is able to execute Flink jobs. The next step is to use bin/flink run to submit a job. Once you have a job, which has enabled checkpointing via StreamExecutionEnvironment.enableCheckpointing, submitted and running it will create checkpoints to the configured location.

If you have retaining of checkpoints enabled, then you can cancel the job and resume it from a checkpoint via bin/flink run -s ....

Job cluster

In case that you are running a per job cluster where the image already contains the user code jars, then you can resume from a savepoint by starting the image with --fromSavepoint <SAVEPOINT_PATH> as a command line argument. Note that <SAVEPOINT_PATH> needs to be accessible from container running the job manager.

Update

In order to resume from a checkpoint when using standalone-job.sh you have to call

FLINK_HOME/bin/standalone-job.sh start-foreground --fromSavepoint <SAVEPOINT/CHECKPOINT_PATH>

Flink resume from externalised checkpoint question

1 Answers

Job cluster

Update