When using mesos, marathon, and zookeeper my mesos-slave doesnt start when I specify the "containerizers" file with "docker,mesos"?

Question

I have 3 CentOS VMs and I have installed Zookeeper, Marathon, and Mesos on the master node, while only putting Mesos on the other 2 VMs. The master node has no mesos-slave running on it. I am trying to run Docker containers so i specified "docker,mesos" in the containerizes file. One of the mesos-agents starts fine with this configuration and I have been able to deploy a container to that slave. However, the second mesos-agent simply fails when I have this configuration (it works if i take out that containerizes file but then it doesn't run containers). Here are some of the logs and information that has come up:

Here are some "messages" in the log directory:

Apr 26 16:09:12 centos-minion-3 systemd: Started Mesos Slave.
Apr 26 16:09:12 centos-minion-3 systemd: Starting Mesos Slave...
WARNING: Logging before InitGoogleLogging() is written to STDERR
[main.cpp:243] Build: 2017-04-12 16:39:09 by centos
[main.cpp:244] Version: 1.2.0
[main.cpp:247] Git tag: 1.2.0
[main.cpp:251] Git SHA: de306b5786de3c221bae1457c6f2ccaeb38eef9f
[logging.cpp:194] INFO level logging started!
[systemd.cpp:238] systemd version `219` detected
[main.cpp:342] Inializing systemd state
[systemd.cpp:326] Started systemd slice `mesos_executors.slice`
[containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
[linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
[provisioner.cpp:249] Using default backend 'copy'
[slave.cpp:211] Mesos agent started on (1)@172.22.150.87:5051
[slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
[slave.cpp:541] Agent resources: cpus(*):1; mem(*):919; disk(*):2043; ports(*):[31000-32000]
[slave.cpp:549] Agent attributes: [  ]
[slave.cpp:554] Agent hostname: node3
[status_update_manager.cpp:177] Pausing sending status updates
[state.cpp:62] Recovering state from '/var/lib/mesos/meta'
[state.cpp:706] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
[status_update_manager.cpp:203] Recovering status update manager
[docker.cpp:868] Recovering Docker containers
[containerizer.cpp:599] Recovering containerizer
[provisioner.cpp:410] Provisioner recovery complete
[group.cpp:340] Group process (zookeeper-group(1)@172.22.150.87:5051) connected to ZooKeeper
[group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
[group.cpp:418] Trying to create path '/mesos' in ZooKeeper
[detector.cpp:152] Detected a new leader: (id='15')
[group.cpp:699] Trying to get '/mesos/json.info_0000000015' in ZooKeeper
[zookeeper.cpp:259] A new leading master ([email protected]:5050) is detected
Failed to perform recovery: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
       This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Apr 26 16:09:13 centos-minion-3 systemd: Unit mesos-slave.service entered failed state.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service failed.

Logs from docker:

$ sudo systemctl status docker
● docker.service - Docker Application Container Engine Loaded: 
  loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled) 
  Drop-In: /usr/lib/systemd/system/docker.service.d 
  └─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT; 
    24h ago Docs: docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS) 
    Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine. 
    Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'

Logs from flannel:

[flanneld-start: network.go:102] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured

Can you attach more logs? Especially including reason why agent is failing? Do not attach them as images, use code formatted text instead — janisz
Hi janisz- I added all the logs I could find that are relevant to this problem. If there are any more files that would be helpful please let me know. Would love to hear any tips or tricks you might have. Thanks so much for the help! — kmahesh3
Can you post error log too? I think something is missing here. — janisz
Sorry, not sure which file you are referencing. I am currently looking at the /var/log/mesos/ folder in the failed node (I posted a list of the files in this dir in the original post). — kmahesh3

janisz janisz · Accepted Answer · 2017-04-26T21:57:34

You have answer in your logs

Failed to perform recovery: Collect failed: 
Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; 
stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
       This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.

Mesos keeps it state/metadata on local disk. When it's restarted it try to load this state. If configuration changed and is not compatible with previous state it won't start.

Just bring docker to live by fixing problems with flannel and etcd and everything will be fine.

When using mesos, marathon, and zookeeper my mesos-slave doesnt start when I specify the "containerizers" file with "docker,mesos"?

2 Answers