I'm running a mesos cluster with with three masters and slaves currently on the same machines.
My question is that sometime I see that a process gets abruptly stopped both in Marathon and Chronos. After checking the logs I saw, that every time, mesos-slave asked the frameworks to kill those tasks. I've tried to google it, find it here but I haven't found a relevant answer.
How can I log or get to know, why the mesos-slave asks one of the registered framework to kill a task?
Log with relevant lines following:
Jan 25 02:48:58 hostname mesos-slave[9817]: I0125 02:48:58.143537 9843 slave.cpp:1372] Asked to kill task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:48:59 hostname mesos-slave[9817]: I0125 02:48:59.108821 9834 slave.cpp:2215] Handling status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 from executor(1)@192.168.49.1:42710
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:04.976814 9823 status_update_manager.cpp:317] Received status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.108389 9823 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.280825 9848 slave.cpp:2458] Forwarding the update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 to [email protected]:5050
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.346415 9848 slave.cpp:2391] Sending acknowledgement for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000 to executor(1)@192.168.49.1:42710
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.443266 9820 status_update_manager.cpp:389] Received status update acknowledgement (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:05 hostname mesos-slave[9817]: I0125 02:49:05.443447 9820 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_KILLED (UUID: abad489c-73bb-4f45-abbe-85f033ddde51) for task TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2 of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.419437 9833 slave.cpp:2898] Executor 'TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' of framework 20150311-145345-20031680-5050-2698-0000 exited with status 0
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.445489 9833 slave.cpp:3007] Cleaning up executor 'TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' of framework 20150311-145345-20031680-5050-2698-0000
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471329 9837 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2/runs/473a2313-0147-44ae-ab9c-b39f5a23be22' for gc 6.99999454929185days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471817 9837 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' for gc 6.99999454685037days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471911 9837 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2/runs/473a2313-0147-44ae-ab9c-b39f5a23be22' for gc 6.99999454636444days in the future
Jan 25 02:49:34 hostname mesos-slave[9817]: I0125 02:49:34.471997 9837 gc.cpp:56] Scheduling '/tmp/mesos/meta/slaves/20150512-155858-53586112-5050-11767-S0/frameworks/20150311-145345-20031680-5050-2698-0000/executors/TASKNAME.4b060055-b85a-11e5-8a34-52eb089dbeb2' for gc 6.99999454594963days in the future
One answer I found to someone's question with the same error suggested to check if it gets killed by the OOM killer, I checked and there is no out of memory problem, no relevant kernel log. The mesos-slave itself logs that is asks the framework to kill it so I don't think it's an outside process, correct me if I'm wrong.
I currently use:
Mesos: 0.21.1-1.2.debian77
Marathon: 0.8.0-1.1.97.debian77
Chronos: 2.3.2-0.1.20150207000917.debian77
I do know they are outdated, but this problem occurs for a long time seemingly at random times affecting random containers, and even if it occurs less in future releases I still bothered why a slave decides to kill a task without logging any reason...
If you need any more logs just ask which one to provide. I only included so little because that container was running for more than a day without any problem or error/warn log in mesos or stderr and suddenly the first line appeared in the log asking the slave to kill it.