Can we have --pod-eviction-timeout=300m?

Question

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.

To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.

The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.

Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?

P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?

Why not just set the eviction thresholds to whatever your actual organizational tolerances are? As you said, the system only activates to protect itself under actual load spikes meaning your resource limits are not correct. — coderanger
I just wanted to understand whether there are any impact of having a longer eviction threshold. I couldn't find any document that talks about best practices of using pod-eviction-threshold. That's why I had to raise this question on this forum. — Karthik

Crou Crou · Accepted Answer · 2020-04-01T14:33:19

As @coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.

If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration. If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.

if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,

As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much

Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s

This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.

If you still want to change default values to higher you can look into changing these:

kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m

to higher ones.

If you provide more details I'll try to help more.

Can we have --pod-eviction-timeout=300m?

1 Answers