I have the following setup - 4 CentOS 7.0 VMs named master, box01, box02, box03.
master VM has: mesos-master, mesos-slave
box01 : mesos-master, mesos-slave, zkServer
box02 : mesos-master, mesos-slave, zkServer
box03 : mesos-slave, zkServer
Whenever, I run a mesos framework on the cluster WITHOUT zookeeper started everything runs fine. However, when I deploy and start zookeeper cluster, the framework I run will ONLY finish if the framework was run from the SAME machine that is the ACTIVE mesos master.
E.g. I have the elected master to be at box01. If I run a framework from box01 it completes well. If I run it from the master box I get the following log on the client side and it never continues:
I1101 13:56:11.997733 5384 sched.cpp:164] Version: 0.24.0
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@716: Client environment:host.name=master.localdomain
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/user/download
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=box01:2181,box02:2181,box03:2181 sessionTimeout=10000 watcher=0x7f560236e6d4 sessionId=0 sessionPasswd=<null> context=0x7f5604003c50 flags=0
2015-11-01 13:56:12,018:5383(0x7f55fd613700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.11:2181]
2015-11-01 13:56:12,025:5383(0x7f55fd613700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.11:2181], sessionId=0x150c2c9ffc6002d, negotiated timeout=10000
I1101 13:56:12.027992 5398 group.cpp:331] Group process (group(1)@10.0.0.10:35217) connected to ZooKeeper
I1101 13:56:12.028153 5398 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1101 13:56:12.028198 5398 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1101 13:56:12.036267 5398 detector.cpp:156] Detected a new leader: (id='11')
I1101 13:56:12.037309 5398 group.cpp:674] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
I1101 13:56:12.041631 5398 detector.cpp:481] A new leading master ([email protected]:5050) is detected
I1101 13:56:12.042068 5398 sched.cpp:262] New master detected at [email protected]:5050
I1101 13:56:12.043937 5398 sched.cpp:272] No credentials provided. Attempting to register without authentication
we can see that the client successfully finds out that 10.0.0.11(box01) is the acting master. If at this point I kill the acting mesos master (box01) a new election will occur and since the quorum of 2 is there (master and box03 boxes) a new master will be elected. If this master is the master box, then the framework will successfully do the task. If it is box03, the client will find out this is the master and again will hang. There should be an easy explanation to this but I can't seem to get out of my thinking box at this point. Please help out.
I am using mesos-0.24.0, zookeeper-3.4.6.
zookeeper-3.4.6/conf/zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=box01:2888:3888
server.2=box02:2888:3888
server.3=box03:2888:3888
/etc/hosts file
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.0.10 master master.localdomain
10.0.0.11 box01 box01.localdomain
10.0.0.12 box02 box02.localdomain
10.0.0.13 box03 box03.localdomain
On each machine the firewall settings are:
--firewall-cmd --list-ports
5051/tcp 3888/tcp 2181/tcp 2888/tcp 5050/tcp
To start mesos-master I use:
/home/user/download/mesos-0.24.0/build/bin/mesos-master.sh --ip=10.0.0.10 --work_dir=/home/user/download/data-mesos --zk=zk://box01:2181,box02:2181,box03:2181/mesos --quorum=2
To start mesos-slave I use:
/home/user/download/mesos-0.24.0/build/bin/mesos-slave.sh --master=zk://box01:2181,box02:2181,box03:2181/mesos
EDIT :
It turns out that if I run stand-alone mesos master on box02(10.0.0.12) and I try run the framework from the master(10.0.0.10) box the framework run request job is received by the mesos master, but it is not executed
master box framework log
[root@master ~]# java -Djava.library.path=/usr/local/lib -jar /home/user/download/test-framework/example-framework-1.0-SNAPSHOT-jar-with-dependencies.jar box02:5050
I1103 13:44:21.898962 20958 sched.cpp:164] Version: 0.24.0
I1103 13:44:21.910660 20972 sched.cpp:262] New master detected at [email protected]:5050
I1103 13:44:21.913422 20972 sched.cpp:272] No credentials provided. Attempting to register without authentication
Therefore, it seems that zookeeper has nothing to do with the problem, but rather for some reason the master cannot send back anything to the machine executing the framework (the mesos scheduler).