1
votes

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.

  1. I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
  2. To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?

From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.

EDIT

TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment

a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)

b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.

3

3 Answers

3
votes

The scale down behavior can be configured with what is called a Disruption Budget

In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.

For how to do it, check out the K8s Documentation.

1
votes

Below are some insight, hope this will help :

  1. If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.

  2. If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc

0
votes

Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.

Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.

By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it. So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.

I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.

Beware that is not official by any means but serves as a workaround for my use case.