Mesos cluster does not recover when physical host restart

Question

I'm using mesosphere on 3 host over Ubuntu 14.04 as follow:

one with mesos master
two with mesos slave

All work fine, but after restart all physical hosts all scheduled job was lost. It's normal? I'm expected that zookeeper will store the current jobs, then when the system will need restart it, all jobs will be rescheduled after the master boot.

Update: I'm using marathon and mesos on a same node, and I'm run marathon with flag --zk

Adam Adam · Accepted Answer · 2015-02-09T22:06:42

With marathon's --zk and --ha enabled, Marathon should be storing its state in ZK and recovering it on restart, as long as Mesos allows it to reregister with the same framework ID.

However, you'll also need to enable the Mesos registry (even for a single master), to ensure that Mesos persists information about what frameworkIds are registered in the event of master failover. This can be accomplished by setting the --registry=replicated_log (default), --quorum=1 (since you only have 1 master), and --work_dir=/path/to/registry (where to store the state).

Mesos cluster does not recover when physical host restart

3 Answers