How can a failed Kubernetes Ceph node be deleted automatically?

Question

On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC. The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).

If I remove the node from the cluster using kubectl delete node dead-node, the pod can start because the resources get removed.

How can I do this automatically? Some possibilities I have thought about are:

Can I set a force detach timeout for the volume?
Set a delete node timeout?
Automatically delete a node with given taints?

I can use the ReadWriteMany mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.

Rico Rico · Accepted Answer · 2020-07-29T05:00:55

You can probably have a sidecar container and tweak the Readiness and Liveness probes in your pod so that the pod doesn't restart if a Ceph block volume is unreachable for some time by the container that it's using it. (There may be other implications to your application though)

Something like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: ceph
  name: ceph-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5
  - name: cephclient
    image: ceph
    volumeMounts:
    - name: ceph
      mountPath: /cephmountpoint
    livenessProbe:
      ... 👈 something
      initialDelaySeconds: 5
      periodSeconds: 3600 👈 make this real long

✌️☮️

How can a failed Kubernetes Ceph node be deleted automatically?

1 Answers