0
votes

I'm trying to achieve HA with three machines and having masters & slaves like below. I'm using VM's for local test setup and my observations are below.

Case 1:

m1 -> leader master

m2 -> non-leader master, slave1

m3 -> non-leader master, slave2

  • Case1.1: When I power off VM m1 machine, one of non-leader becomes leading and able to access cluster, working properly.

  • Case1.2: I power off m2 or m3 (any one of the vm with non-master & slave). I've seen message on webpage of m3 or m2 'No Master is currently leading'. when I try to access mesos in m1 and any one of the available machine(m2 or m3).

Case2:

m1->non-leader

m2->leader,slave1,

m3->non-leader,slave2

  • Case2.1: When I power off VM m1 machine, leader in m2 will be sustained and cluster works properly.

  • Case2.2: When I power off m2 (leader with slave), cluster becomes unavailable with error message 'No Master is currently leading' on web page.

  • Case2.3: When I power off m3 (non-leader with slave),cluster becomes unavailable with error message 'No Master is currently leading' on web page.

Apologies for trying HA with only 3 machines and lengthy problem explanation.

Questions :

  • Killing machine with both master(leading/non-leading) and slave will always lead to cluster unavailability? (case 1.2,2.2,2.3)

  • Can we achieve HA with three machines like above i.e having 3 masters and 2 slaves with masters and slaves on same machines?

    Following are the configuration.

Masters :

m1 : mesos-master --ip=192.168.1.36 --hostname=192.168.1.36 --port=6060 --quorum=2 --cluster=mesosCluster --zk=zk://192.168.1.36:2181,192.168.1.42:2181,192.168.1.45:2181/mesos --work_dir=/opt/ncms/mesosWorkDir/ --log_dir=/opt/ncms/mesosWorkDir/logs

m2 : mesos-master --ip=192.168.1.42 --hostname=192.168.1.42 --port=6060 --quorum=2 --cluster=mesosCluster --zk=zk://192.168.1.36:2181,192.168.1.42:2181,192.168.1.45:2181/mesos --work_dir=/opt/ncms/mesosWorkDir/ --log_dir=/opt/ncms/mesosWorkDir/logs

m3 : mesos-master --ip=192.168.1.45 --hostname=192.168.1.45 --port=6060 --quorum=2 --cluster=mesosCluster --zk=zk://192.168.1.36:2181,192.168.1.42:2181,192.168.1.45:2181/mesos --work_dir=/opt/ncms/mesosWorkDir/ --log_dir=/opt/ncms/mesosWorkDir/logs

Slaves :

m2 : mesos-slave --ip=192.168.1.42 --hostname=192.168.1.42 --executor_registration_timeout=10mins --systemd_enable_support=false --master=zk://192.168.1.42:2181,192.168.1.45:2181,192.168.1.36:2181/mesos --containerizers=mesos,docker

m3 : mesos-slave --ip=192.168.1.45 --hostname=192.168.1.45 --executor_registration_timeout=10mins --systemd_enable_support=false --master=zk://192.168.1.42:2181,192.168.1.45:2181,192.168.1.36:2181/mesos --containerizers=mesos,docker

Zookeeper Config :

tickTime=2000

initLimit=10

syncLimit=5

dataDir=/opt/ncms/zkWorkDir

clientPort=2181

server.1=192.168.1.42:2888:3888 server.3=192.168.1.36:2888:3888

server.5=192.168.1.45:2888:3888

Setup :

Host: Windows 7 (64GB RAM, 24 Cores )

Virtual Box : each vm(m1, m2, m3) has 2 cores and 2 GB RAM with RHEL 7.2

1

1 Answers

0
votes

In scenarios you describe, the number of active masters falls below quorum, which is 2 in your case. This is considered an exceptional situation and certain operations will not succeed, for example, any operation modifying the distributed registry.