0
votes

I have a 3 nodes setup running Marathon, mesos-master,mesos-slave and Zookeeper with HA config enabled, then tested a deployment of simple hello app using mesos-execute and it's working as expected.

Now everything looks fine, so I connect to Marathon and deploy a simple app to test marathon: (echo "hello" >> /tmp/output.txt) but the application get sucked in "waiting" status.

what could be the problem preventing Marathon to use mesos resources for deployment ?

Logs from mesos-master:

I0904 11:23:27.064332 19769 master.cpp:2813] Received SUBSCRIBE call for framework 'marathon' at [email protected]:36324
I0904 11:23:27.064623 19769 master.cpp:2890] Subscribing framework marathon with checkpointing enabled and capabilities [ PARTITION_AWARE ]
I0904 11:23:27.064669 19769 master.cpp:6272] Updating info for framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.064697 19769 master.cpp:2994] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324 failed over
I0904 11:23:27.065032 19770 hierarchical.cpp:342] Activated framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.065465 19770 master.cpp:7305] Sending 3 offers to framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324
I0904 11:23:27.907865 19769 http.cpp:1115] HTTP GET for /files/read?_=1504517007920&jsonp=jQuery17109098185077823333_1504516979864&length=50000&offset=352538&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
I0904 11:23:28.916651 19768 http.cpp:1115] HTTP GET for /files/read?_=1504517008930&jsonp=jQuery17109098185077823333_1504516979865&length=50000&offset=353797&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
E0904 11:23:30.071293 19775 process.cpp:2450] Failed to shutdown socket with fd 39, address 192.168.40.159:58072: Transport endpoint is not connected
I0904 11:23:30.073277 19768 master.cpp:1430] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324 disconnected
I0904 11:23:30.073307 19768 master.cpp:3160] Deactivating framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324
I0904 11:23:30.073485 19768 master.cpp:3137] Disconnecting framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324
I0904 11:23:30.073496 19768 master.cpp:1445] Giving framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at [email protected]:36324 1weeks to failover
I0904 11:23:30.073519 19768 hierarchical.cpp:374] Deactivated framework cb16118a-2257-4020-a907-63aa6294e11b-0000

curl -XGET 'http://mesosphere2:8098/v2/queue?pretty' | jq

{
  "queue": [
    {
      "count": 1,
      "delay": {
        "timeLeftSeconds": 0,
        "overdue": true
      },
      "since": "2017-09-04T13:12:42.024Z",
      "processedOffersSummary": {
        "processedOffersCount": 12,
        "unusedOffersCount": 12,
        "lastUnusedOfferAt": "2017-09-04T13:14:52.554Z",
        "rejectSummaryLastOffers": [
          {
            "reason": "UnfulfilledRole",
            "declined": 3,
            "processed": 3
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ],
        "rejectSummaryLaunchAttempt": [
          {
            "reason": "UnfulfilledRole",
            "declined": 12,
            "processed": 12
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ]
      },
      "app": {
        "id": "/test03",
        "acceptedResourceRoles": [
          "slave_public"
        ],
        "backoffFactor": 1.15,
        "backoffSeconds": 1,
        "container": {
          "type": "DOCKER",
          "docker": {
            "forcePullImage": false,
            "image": "laghao/hello-marathon",
            "network": "BRIDGE",
            "parameters": [],
            "portMappings": [
              {
                "containerPort": 80,
                "hostPort": 80,
                "labels": {},
                "protocol": "tcp",
                "servicePort": 10003
              }
            ],
            "privileged": false
          },
          "volumes": []
        },
        "cpus": 0.1,
        "disk": 0,
        "executor": "",
        "instances": 1,
        "labels": {},
        "maxLaunchDelaySeconds": 3600,
        "mem": 64,
        "gpus": 0,
        "portDefinitions": [
          {
            "port": 10003,
            "name": "default",
            "protocol": "tcp"
          }
        ],
        "requirePorts": false,
        "upgradeStrategy": {
          "maximumOverCapacity": 1,
          "minimumHealthCapacity": 1
        },
        "version": "2017-09-04T13:12:41.993Z",
        "versionInfo": {
          "lastScalingAt": "2017-09-04T13:12:41.993Z",
          "lastConfigChangeAt": "2017-09-04T13:12:41.993Z"
        },
        "killSelection": "YOUNGEST_FIRST",
        "unreachableStrategy": {
          "inactiveAfterSeconds": 300,
          "expungeAfterSeconds": 600
        }
      }
    }
  ]
}
1
Can you show Marathons logs? waiting means there are no resources available to meet application constraints. In latest Marathon 1.4+ you can debug what resources are missing for given deployment with /v2/queue endpoint.janisz

1 Answers

0
votes

From documentation

An app stays in “Waiting” forever This means that Marathon does not receive “Resource Offers” from Mesos that allow it to start tasks of this application. The simplest failure is that there are not sufficient resources available in the cluster or another framework hords all these resources. You can check the Mesos UI for available resources. Note that the required resources (such as CPU, Mem, Disk) have to be all available on a single host.

If you do not find the solution yourself and you create a GitHub issue, please append the output of Mesos /state endpoint to the bug report so that we can inspect available cluster resources.

In your case there is a problem with application role requirement and agent role. You can deduce it from UnfulfilledRole.

Marathon 1.4 introduced information about stuck deployments. You can query /v2/queue and get statistics why offers were declined.