Flink job/data flow on multiple task managers failure and recovery

Question

If we have 2 task managers, each is running in different JVM (as it always the case), and suppose we have an operator in the middle of the data flow failed, through an exception or JVM terminated by failure, can we assume that the entire data flow, including all sources and operators from all task managers that participate in that job/data flow will fail and restart (if restart is enabled)? Reading your docs I understand the answer is yes, but would like to make sure.

JVM 1/
Task manager 1
    source1 (1) --> operator1 (1) --> ...
                |
JVM 2/          |
Task manager 2  |
                |
                --> operator1 (2) --> ...

So suppose operator1 (2) fails/its JVM fails, will source1 instance, and both operator1 instances, will fail and restart?

Please post your questions separately and give them a meaningful title. That is important because others should be able to find relevant questions and answers as well. — Fabian Hueske
OK - I left here one and will publish the other 2 later. Thanks! — Shay

Fabian Hueske Fabian Hueske · Accepted Answer · 2018-06-15T17:26:55

Yes, that is correct. In the current version (Flink 1.5.0) a job is recovered by canceling all tasks and restarting them.

However, this might change in the future to speed up the recovery cycle. If that happens, tasks would be paused, reload their state from the last successful checkpoint, and continue when the failed task(s) have been recovered.

Flink job/data flow on multiple task managers failure and recovery

1 Answers