3
votes

I'm running a mesos cluster with with three masters and slaves currently on the same machines.

My question is that sometime I see that a process gets abruptly stopped both in Marathon and Chronos. After checking the logs I saw, that every time, mesos-slave asked the frameworks to kill those tasks. I've tried to google it, find it here but I haven't found a relevant answer.

How can I log or get to know, why the mesos-slave asks one of the registered framework to kill a task?

Log with relevant lines following:

Jan 25 02:48:58 hostname mesos-slave[9817]: I0125 02:48:58.143537  9843 slave.cpp:1372] Asked to kill task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:48:59 hostname mesos-slave[9817]: I0125 02:48:59.108821  9834 slave.cpp:2215] Handling status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 from executor(1)@192.168.49.1:42710
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:04.976814  9823 status_update_manager.cpp:317] Received status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.108389  9823 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.280825  9848 slave.cpp:2458] Forwarding the update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 to [email protected]:5050
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.346415  9848 slave.cpp:2391] Sending acknowledgement for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 to executor(1)@192.168.49.1:42710
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.443266  9820 status_update_manager.cpp:389] Received status update acknowledgement (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.443447  9820 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.419437  9833 slave.cpp:2898] Executor 'TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' of framework 20150311-145345-20031680-5050-2698-0000 exited with status 0
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.445489  9833 slave.cpp:3007] Cleaning up executor 'TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471329  9837 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2/runs/473a2313-0147-44ae-ab9c-b39f5a23be22' for gc 6.99999454929185days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471817  9837 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' for gc 6.99999454685037days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471911  9837 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2/runs/473a2313-0147-44ae-ab9c-b39f5a23be22' for gc 6.99999454636444days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471997  9837 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' for gc 6.99999454594963days in the future

One answer I found to someone's question with the same error suggested to check if it gets killed by the OOM killer, I checked and there is no out of memory problem, no relevant kernel log. The mesos-slave itself logs that is asks the framework to kill it so I don't think it's an outside process, correct me if I'm wrong.

I currently use:
Mesos: 0.21.1-1.2.debian77
Marathon: 0.8.0-1.1.97.debian77
Chronos: 2.3.2-0.1.20150207000917.debian77

I do know they are outdated, but this problem occurs for a long time seemingly at random times affecting random containers, and even if it occurs less in future releases I still bothered why a slave decides to kill a task without logging any reason...

If you need any more logs just ask which one to provide. I only included so little because that container was running for more than a day without any problem or error/warn log in mesos or stderr and suddenly the first line appeared in the log asking the slave to kill it.

2
Are the tasks just killed, or migrated to other slaves? If so this could be a matter of missing resources on the slave where the application originally ran...Tobi
After the task gets killed on Marathon it is started again but Marathon doesn't follow the normal scale/upgrade procedure since it ignores the minimumHealthCapacity property; it simply killed the task and starts a new one (sometimes on the same host) without waiting for it to be healthy. If a Chronos task gets killed which was scheduled only for one time run, it wont be restarted by Chronos. These to behaviour pattern are really concerning....Salaander
May I ask why you don't upgrade to more recent versions of the Mesos stack?Tobi
It's in production and an upgrade is already scheduled, but here I'm not mostly bothered by the fact that this error happening but the fact that I have no way of getting information about it from the logs or anywhere and whenever I see other questions mentioning this I found no answer. After the upgrade I will check if the log got any more better/verbose; I only haven't tried trace log level, but I think this should be at least Info level... Thx for the tips anyway.Salaander
I see... Sorry that I couldn't really help.Tobi

2 Answers

0
votes

Add a health check command to your marathon app. New version of Marathon app kills an unhealthy app after a grace period(300 seconds) and it keeps respawning on some other slave

Easiest way to put a health check is by using pwd command

If healthcheck doesn't work, try increasing CPU & RAM

enter image description here

0
votes

We recently encountered this problem as well. What we've found out is the task got actually killed by the OOM killer.

It is mentioned in docker's reference doc that:

By default, kernel kills processes in a container if an out-of-memory (OOM) error occurs. To change this behaviour, use the --oom-kill-disable option.

What we realize is the OOM error doesn't need to be on the mesos-slave host (i.e. the docker host), in case the docker container run out of memory while the docker host still have free memory, the process got killed as well.

After we added --oom-kill-disable option to our marathon task as follow the problem goes away.

"parameters": [
  {
    "key": "oom-kill-disable",
    "value": "true"
  }
]