Why marathon does not terminate jobs after the quorum is lost?

Question

I'm working with Apache mesos and marathon. I have 3 master nodes and 3 slave nodes. I configure mesos with quorum 2. Later I post a JSON to run one job with marathon and all look fine.

Then I try a shutdown of two master nodes to break the quorum, after this, mesos unregister all slave and all look ok, but when I inspect the slaves I found that the started job was continue running...it is normal? I was supposing that marathon stop all job after the quorum is lost.

Adam Adam · Accepted Answer · 2015-02-12T11:52:23

Part of the Mesos philosophy, especially for long-running services, is that a failure in one or more Mesos components should not need to stop the user application.

If a slave shuts down and the framework has checkpointing enabled, the executor driver will wait for the slave's --recovery_timeout (default 15min) before shutting down the executor/tasks. To prevent this, disable checkpointing on your framework (in Marathon, just set --checkpoint=false when starting Marathon). See also Marathon's --failover_timeout on https://mesosphere.github.io/marathon/docs/command-line-flags.html

On the other hand, if it's just the Masters/ZKs that shut down, and the Slaves are still up and running, the slaves can still monitor the tasks and queue up status updates, so the tasks can stay alive. If ZK loses quorum, then there is no leading master, and each slave will continue to operate independently until a new leader is detected, at which point it will reregister with the master and send any queued status updates.

Why marathon does not terminate jobs after the quorum is lost?

1 Answers