Mesos cluster fails to elect master when using replicated

8

votes

Test environment: multi-node mesos 0.27.2 cluster on AWS (3 x masters, 2 x slaves, quorum=2).
Tested persistence with zkCli.sh and it works fine.
If i start the masters with --registry=in_memory, it works fine, master is elected, i can start tasks via Marathon.
If i use the default (--registry=replicated_log) the cluster fails to elect a master:

https://gist.github.com/mitel/67acd44408f4d51af192

EDIT: apparently the problem was the firewall. Applied an allow-all type of rule to all my security groups and now i have a stable master. Once i figure out what was blocking the communication i'll post it here.

apache-zookeepermesosmesospheremarathon

5

votes

Discovered that mesos masters also initiate connections to other masters on 5050. After adding the egress rule to the master's security group, the cluster is stable, master election happens as expected. firewall rules

UPDATE: for those who try to build an internal firewall between the various components of mesos/zk/.. - don't do it. better to design the security as in Mesosphere's DCOS

1

votes

First off, let me briefly clarify the flags meaning for posterity. --registry does not influence leader election, it specifies the persistence strategy for the registry (where Mesos tracks data that should be carried over failover). The in_memory value should not be used in production, it may even be removed in the future.

Leader election is performed by zookeeper. According to your log, you use the following zookeeper cluster: zk://10.1.69.172:2181,10.1.9.139:2181,10.1.79.211:2181/mesos.

Now, from your log, the cluster did not fail to elect the master, it actually did it twice:


I0313 18:35:28.257139  3253 master.cpp:1710] The newly elected leader is [email protected]:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79
...
I0313 18:35:36.074087  3257 master.cpp:1710] The newly elected leader is [email protected]:5050 with id c4fd7c4d-e3ce-4ac3-9d8a-28c841dca7f5

I can't say why exactly the leader was elected twice, for that I would need logs from 2 other masters as well. According to your log, the last elected master is on 10.1.9.139:5050, which is most probably not the one you provided the log from.

One suspicious thing I see in the log is that master IDs differ for the same IP:port. Do you have an idea why?

I0313 18:35:28.237251  3244 master.cpp:374] Master 24ecdfff-2c97-4de8-8b9c-dcea91115809 (10.1.69.172) started on 10.1.69.172:5050
...
I0313 18:35:28.257139  3253 master.cpp:1710] The newly elected leader is [email protected]:5050 with id edd3e4a7-ede8-44fe-b24c-67a8790e2b79

Mesos cluster fails to elect master when using replicated_log

2 Answers