Mesos replica master doesn't continue when active master fails

Question

I have the following setup - 4 CentOS 7.0 VMs named master, box01, box02, box03.

master VM has: mesos-master, mesos-slave

box01 : mesos-master, mesos-slave, zkServer

box02 : mesos-master, mesos-slave, zkServer

box03 : mesos-slave, zkServer

Whenever, I run a mesos framework on the cluster WITHOUT zookeeper started everything runs fine. However, when I deploy and start zookeeper cluster, the framework I run will ONLY finish if the framework was run from the SAME machine that is the ACTIVE mesos master.

E.g. I have the elected master to be at box01. If I run a framework from box01 it completes well. If I run it from the master box I get the following log on the client side and it never continues:

I1101 13:56:11.997733  5384 sched.cpp:164] Version: 0.24.0
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@716: Client environment:host.name=master.localdomain
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/user/download
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=box01:2181,box02:2181,box03:2181 sessionTimeout=10000 watcher=0x7f560236e6d4 sessionId=0 sessionPasswd=<null> context=0x7f5604003c50 flags=0
2015-11-01 13:56:12,018:5383(0x7f55fd613700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.11:2181]
2015-11-01 13:56:12,025:5383(0x7f55fd613700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.11:2181], sessionId=0x150c2c9ffc6002d, negotiated timeout=10000
I1101 13:56:12.027992  5398 group.cpp:331] Group process (group(1)@10.0.0.10:35217) connected to ZooKeeper
I1101 13:56:12.028153  5398 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1101 13:56:12.028198  5398 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1101 13:56:12.036267  5398 detector.cpp:156] Detected a new leader: (id='11')
I1101 13:56:12.037309  5398 group.cpp:674] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
I1101 13:56:12.041631  5398 detector.cpp:481] A new leading master ([email protected]:5050) is detected
I1101 13:56:12.042068  5398 sched.cpp:262] New master detected at [email protected]:5050
I1101 13:56:12.043937  5398 sched.cpp:272] No credentials provided. Attempting to register without authentication

we can see that the client successfully finds out that 10.0.0.11(box01) is the acting master. If at this point I kill the acting mesos master (box01) a new election will occur and since the quorum of 2 is there (master and box03 boxes) a new master will be elected. If this master is the master box, then the framework will successfully do the task. If it is box03, the client will find out this is the master and again will hang. There should be an easy explanation to this but I can't seem to get out of my thinking box at this point. Please help out.

I am using mesos-0.24.0, zookeeper-3.4.6.

zookeeper-3.4.6/conf/zoo.cfg

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=box01:2888:3888
server.2=box02:2888:3888
server.3=box03:2888:3888

/etc/hosts file

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.0.10   master master.localdomain
10.0.0.11   box01 box01.localdomain
10.0.0.12   box02 box02.localdomain
10.0.0.13   box03 box03.localdomain

On each machine the firewall settings are:

--firewall-cmd --list-ports
5051/tcp 3888/tcp 2181/tcp 2888/tcp 5050/tcp

To start mesos-master I use:

/home/user/download/mesos-0.24.0/build/bin/mesos-master.sh --ip=10.0.0.10 --work_dir=/home/user/download/data-mesos --zk=zk://box01:2181,box02:2181,box03:2181/mesos --quorum=2

To start mesos-slave I use:

/home/user/download/mesos-0.24.0/build/bin/mesos-slave.sh --master=zk://box01:2181,box02:2181,box03:2181/mesos

EDIT :

It turns out that if I run stand-alone mesos master on box02(10.0.0.12) and I try run the framework from the master(10.0.0.10) box the framework run request job is received by the mesos master, but it is not executed

box02 master log

master box framework log

[root@master ~]# java -Djava.library.path=/usr/local/lib -jar /home/user/download/test-framework/example-framework-1.0-SNAPSHOT-jar-with-dependencies.jar box02:5050
I1103 13:44:21.898962 20958 sched.cpp:164] Version: 0.24.0
I1103 13:44:21.910660 20972 sched.cpp:262] New master detected at [email protected]:5050
I1103 13:44:21.913422 20972 sched.cpp:272] No credentials provided. Attempting to register without authentication

Therefore, it seems that zookeeper has nothing to do with the problem, but rather for some reason the master cannot send back anything to the machine executing the framework (the mesos scheduler).

Having master logs (from both failed over and taking over master) and the framework log would help triaging the issue. — rukletsov
The framework log is the first piece of code in my original post. I will provide logs from the two masters later today. — LIvanov
I still do not see the log of the active master. Could you please attach it? — rukletsov
box02 is the active master. This is the log. The framework is executed by another box called "master". Sorry for the messed up setup. — LIvanov

rukletsov rukletsov · Accepted Answer · 2015-11-06T14:45:55

From the master logs you provided my guess is that the master cannot open a connection to your framework. This portion of the master log looks suspicious:

I1103 13:44:21.513394 11288 master.cpp:2094] Received SUBSCRIBE call for framework 'framework-example' at [email protected]:36455
I1103 13:44:21.513703 11288 master.cpp:2164] Subscribing framework framework-example with checkpointing disabled and capabilities [  ]
I1103 13:44:21.516088 11288 hierarchical.hpp:391] Added framework 20151103-134410-201326602-5050-11260-0000
I1103 13:44:21.517375 11288 master.cpp:4613] Sending 1 offers to framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at [email protected]:36455
E1103 13:44:21.519042 11291 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107]
I1103 13:44:21.520539 11288 master.cpp:1051] Framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at [email protected]:36455 disconnected
I1103 13:44:21.520593 11288 master.cpp:2370] Disconnecting framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at [email protected]:36455
I1103 13:44:21.520608 11288 master.cpp:2394] Deactivating framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at [email protected]:36455
W1103 13:44:21.520922 11288 master.hpp:1409] Master attempted to send message to disconnected framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at [email protected]:36455

Could you please check that the LIBPROCESS_IP variable is set correctly on the framework node and the master can open a connection to the framework node?

Mesos replica master doesn't continue when active master fails

1 Answers