Kubernetes - pods not being evicted from dead nodes

Question

I have a kube cluster setup with kubeadm init (mostlydefaults). Everything works as intended, except for the fact that if one of my nodes goes offline while pods are running on it, the pods stay in the Running status indefinitely. From what I've read, they should go to Unknown or Failure status, and after --pod-eviction-timeout (default 5m) they should be rescheduled to another healthy node.

Here's my pods after 20 + minutes of Node 7 being offline (I've also left it for over two days once with no reschedule):

kubectl get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP             NODE    NOMINATED NODE   READINESS GATES
workshop-30000-77b95f456c-sxkp5        1/1     Running   0          20m   REDACTED       node7   <none>           <none>
workshop-operator-657b45b6b8-hrcxr     2/2     Running   0          23m   REDACTED       node7   <none>           <none>

kubectl get deployments -o wide
NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS         IMAGES                                                                                          SELECTOR
deployment.apps/workshop-30000      0/1     1            0           21m   workshop-ubuntu    REDACTED                                                            client=30000
deployment.apps/workshop-operator   0/1     1            0           17h   ansible,operator   REDACTED   name=workshop-operator

You can see the pods still flagged as Running, whereas their deployments have Ready: 0/1.

Here are my nodes:

kubectl get nodes -o wide
NAME                STATUS     ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
kubernetes-master   Ready      master   34d    v1.17.3   REDACTED      <none>        Ubuntu 19.10   5.3.0-42-generic   docker://19.3.2
kubernetes-worker   NotReady   <none>   34d    v1.17.3   REDACTED      <none>        Ubuntu 19.10   5.3.0-29-generic   docker://19.3.2
node3               NotReady   worker   21d    v1.17.3   REdACTED      <none>        Ubuntu 19.10   5.3.0-40-generic   docker://19.3.2
node4               Ready      <none>   19d    v1.17.3   REDACTED      <none>        Ubuntu 19.10   5.3.0-40-generic   docker://19.3.2
node6               NotReady   <none>   5d7h   v1.17.4   REDACTED      <none>        Ubuntu 19.10   5.3.0-42-generic   docker://19.3.6
node7               NotReady   <none>   5d6h   v1.17.4   REDACTED      <none>        Ubuntu 19.10   5.3.0-42-generic   docker://19.3.6

What could the issue be? All my containers have readiness and liveness probes. I've tried searching through the docs and elsewhere, but haven't been able to find anything that solves this.

Currently, if a node goes down, the only way I can get the pods that were on it to be rescheduled to a live node is if I manually delete them with --force and --graceperiod=0, which defeats some of the main goals of Kubernetes: automation and self-healing.

According to the docs: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failed.

---------- Extra information ---------------

kubectl describe pods workshop-30000-77b95f456c-sxkp5
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  <unknown>          default-scheduler  Successfully assigned workshop-operator/workshop-30000-77b95f456c-sxkp5 to node7
  Normal   Pulling    37m                kubelet, node7     Pulling image "REDACTED"
  Normal   Pulled     37m                kubelet, node7     Successfully pulled image "REDACTED"
  Normal   Created    37m                kubelet, node7     Created container workshop-ubuntu
  Normal   Started    37m                kubelet, node7     Started container workshop-ubuntu
  Warning  Unhealthy  36m (x2 over 36m)  kubelet, node7     Liveness probe failed: Get http://REDACTED:8080/healthz: dial tcp REDACTED:8000: connect: connection refused
  Warning  Unhealthy  36m (x3 over 36m)  kubelet, node7     Readiness probe failed: Get http://REDACTED:8000/readyz: dial tcp REDACTED:8000: connect: connection refused

I believe those liveness and readiness probe failures were just due to slow start. It seems it's not checking liveness/readiness after the node goes down (last check was 37 minutes ago).

This is a self hosted cluster with the following version:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Thanks to all who help.

EDIT: It was either a bug or a potential misconfiguration when initially bootstrapping with kubeadm. A full reinstall of kubernetes cluster and update from 1.17.4 to 1.18 solved the problem and now pods are rescheduled from dead nodes.

Arghya Sadhu Arghya Sadhu · Accepted Answer · 2020-03-23T04:48:45

With TaintBasedEvictions feature flag set to true(default) after Kubernetes version 1.13 you can set your pods eviction time within its spec under tolerations.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

If after the 300 sec (default) or 2 sec(set in tolerations) the pods are not rescheduled you need to kubectl delete node which would trigger a reschedule of the pods on the node.

Kubernetes - pods not being evicted from dead nodes

1 Answers