We're sort of new to the whole Kubernetes world, but by now have a number of services running in GKE. Today we saw some strange behaviour though, where one of the processes running inside one of our pods was killed, even though the Pod itself had plenty of resources available, and wasn't anywhere near its limits.
The limits are defined as such:
resources:
requests:
cpu: 100m
memory: 500Mi
limits:
cpu: 1000m
memory: 1500Mi
Inside the pod, a Celery (Python) is running, and this particular one is consuming some fairly long running tasks.
During operation of one of the tasks, the celery process was suddenly killed, seemingly caused by OOM. The GKE Cluster Operations logs show the following:
Memory cgroup out of memory: Kill process 613560 (celery) score 1959 or sacrifice child
Killed process 613560 (celery) total-vm:1764532kB, anon-rss:1481176kB, file-rss:13436kB, shmem-rss:0kB
The resource graph for the time period looks like the following:
As can be clearly seen, neither the CPU or the memory usage was anywhere near the limits defined by the Pod, so we're baffled as to why any OOMKilling occurred? Also a bit baffled by the fact that the process itself was killed, and not the actual Pod?
Is this particular OOM actually happening inside the OS and not in Kubernetes? And if so - is there a solution to getting around this particular problem?
kubectl top pods [name]
shows a significantly higher number while the process is running - and this number actually touches the memory limit. I checked our Elastic monitoring solution, and it shows the same values as is shown in the GKE dashboard. – Tingyou