POD affinity rule to schedule pods across all nodes

Question

we are running 6 nodes in K8s cluster. Out of 6, 2 of them running RabbitMQ, Redis & Prometheus we have used node-selector & cordon node so no other pods schedule on that particular nodes.

On renaming other 4 nodes application PODs run, we have around 18-19 micro services. For GKE there is one open issue in K8s official repo regarding auto scale down: https://github.com/kubernetes/kubernetes/issues/69696#issuecomment-651741837 automatically however people are suggesting approach of setting PDB and we that tested on Dev/Stag.

What we are looking for now is to fix PODs on particular node pool which do not scale, as we are running single replicas of some services.

As of now, we thought of using and apply affinity to those services which are running with single replicas and no requirement of scaling.

while for scalable services we won't specify any type of rule so by default K8s scheduler will schedule pod across different nodes, so this way if any node scale down we dont face any downtime for single running replica service.

Affinity example :

affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: do-not-scale
                operator: In
                values:
                - 'true'

We are planning to use affinity type preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution.

Note : Here K8s is not creating new replica first on another node during node drain (scaledown of any node) as we are running single replicas with rolling update & minAvailable: 25% strategy.

Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.

To make sure the application will be available during the node draining process we have to specify PodDisruptionBudget and create more replicas. If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).

Please point out a mistake if you are seeing anything wrong & suggest better option.

mario mario · Accepted Answer · 2020-11-25T18:10:38

First of all, defining PodDisruptionBudget makes not much sense whan having only one replica. minAvailable expressed as a percentage is rounded up to an integer as it represents the minimum number of Pods which need to be available all the time.

Keep in mind that you have no guarantee for any High Availability when launching only one-replica Deployments.

Why: If PodDisruptionBudget is not specified and we have a deployment with one replica, the pod will be terminated and then a new pod will be scheduled on a new node.

If you didn't explicitely define in your Deployment's spec the value of maxUnavailable, by default it is set to 25%, which being rounded up to an integer (representing number of Pods/replicas) equals 1. It means that 1 out of 1 replicas is allowed to be unavailable.

If we have 1 pod with minAvailable: 30% it will refuse to drain node (scaledown).

Single replica with minAvailable: 30% is rounded up to 1 anyway. 1/1 should be still up and running so Pod cannot be evicted and node cannot be drained in this case.

You can try the following solution however I'm not 100% sure if it will work when your Pod is re-scheduled to another node due to it's eviction from the one it is currently running on.

But if you re-create your Pod e.g. because you update it's image to a new version, you can guarantee that at least one replica will be still up and running (old Pod won't be deleted unless the new one enters Ready state) by setting maxUnavailable: 0. As per the docs, by default it is set to 25% which is rounded up to 1. So by default you allow that one of your replicas (which in your case happens to be 1/1) becomes unavailable during the rolling update. If you set it to zero, it won't allow the old Pod to be deleted unless the new one becomes Ready. At the same time maxSurge: 2 allows that 2 replicas temporarily exist at the same time during the update.

Your Deployment definition may begin as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0 👈
      maxSurge: 2
  selector:
  ...

Compare it with this answer, provided by mdaniel, where I originally found it.

POD affinity rule to schedule pods across all nodes

1 Answers