Recovery with a single JobManager

Question

I am attempting to recover my jobs and state when my job manager goes down and I haven't been able to restart my jobs successfully.

From my understanding, TaskManager recovery is aided by the JobManager (this works as expected) and JobManager recovery is completed through Zookeeper.

I am wondering if there is a way to recover the jobmanager without zookeeper?

I am using docker for my setup and all checkpoints & savepoints are persisted to mapped volumes.

Is flink able to recover when all job managers go down? I can afford to wait for the single JobManager to restart.

When I restart the jobmanager I get the following exception: org.apache.flink.runtime.rest.NotFoundException: Job 446f4392adc32f8e7ba405a474b49e32 not found

I have set the following in my flink-conf.yaml

state.backend: filesystem
state.checkpoints.dir: file:///opt/flink/checkpoints
state.savepoints.dir: file:///opt/flink/savepoints

I think my issue may that the JAR gets deleted when the job manager is restarted but I am not sure how to solve this.

Till Rohrmann Till Rohrmann · Accepted Answer · 2018-09-07T07:14:39

At the moment, Flink only supports to recover from a JobManager fault if you are using ZooKeeper. However, theoretically you can also make it work without it if you can guarantee that there is only a single JobManager ever running. See this answer for more information.

Recovery with a single JobManager

2 Answers