I have readiness probes configured on several pods (which are members of deployment-managed replica sets). They work as expected -- readiness is required as part of the deployment's rollout strategy, and if a healthy pod becomes NotReady, the associated Service will remove it from the pool of endpoints until it becomes Ready again.
Furthermore, I have external health checking (using Sensu) that alerts me when a pod becomes NotReady.
Sometimes, a pod will report NotReady for an extended period of time, showing no sign of recovery. I would like to configure things such that, if a pod stays in NotReady for an extended period of time, it gets evicted from the node and rescheduled elsewhere. I'll settle for a mechanism that simply kills the container (leading it to be restarted in the same pod), but what I really want is for the pod to be evicted and rescheduled.
I can't seem to find anything that does this. Does it exist? Most of my searching turns up things about evicting pods from NotReady nodes, which is not what I'm looking for at all.
If this doesn't exist, why? Is there some other mechanism I should be using to accomplish something equivalent?
EDIT: I should have specified that I also have liveness probes configured and working the way I want. In the scenario I’m talking about, the pods are “alive.” My liveness probe is configured to detect more severe failures and restart accordingly and is working as desired.
I’m really looking for the ability to evict based on a pod being live but not ready for an extended period of time.
I guess what I really want is the ability to configure an arbitrary number of probes, each with different expectations it checks for, and each with different actions it will take if a certain failure threshold is reached. As it is now, liveness failures have only one method of recourse (restart the container), and readiness failures also have just one (just wait). I want to be able to configure any action.
restartPolicy
of Always or OnFailure? – apisimlivenessProbe.exec.command
. – apisim