Pod anti-affinity and re-balancing pods

Question

I have a deployment with 2 replicas. I would like to specify that, when possible, pods should be load balanced between as many nodes/hostnames. So far, I have the following spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: topspin-apollo-backend-staging-dep
  labels:
    app: topspin-apollo-backend
    env: staging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: topspin-apollo-backend
      env: staging
  template:
    metadata:
      labels:
        app: topspin-apollo-backend
        env: staging
    spec:
      containers:
        - name: topspin-apollo-backend
          image: rwu1997/topspin-apollo-backend:latest
          imagePullPolicy: Always
          envFrom:
            - secretRef:
                name: topspin-apollo-backend-staging-secrets
          ports:
            - containerPort: 8000
      imagePullSecrets:
        - name: regcred
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: topspin-apollo-backend
                    env: staging
                topologyKey: "kubernetes.io/hostname"

If I kubectl apply this deployment from scratch, k8s correctly schedules a pod on each of the 2 nodes in the cluster (A and B). If I kill one of the nodes B, the corresponding pod is re-scheduled on the last remaining node A (as expected).

When I re-add another node C back to the cluster, the two pods remain scheduled on node A. This is expected as far as I know.

Is there a way to trigger the scheduler to re-balance the 2 pods amongst node A and C?

I've tried kubectl scale --replicas=4, have two additional pods scheduled on node C, then kubectl scale --replicas=2, but it seems to kill off the 2 most recently scheduled pods (instead of prioritizing the pod anti-affinity).

One method that works is to kubectl delete the deployment then kubectl apply, but this introduces downtime.

Another method is to kubectl scale --replicas=1, then kubectl scale --replicas=2, but it's less than ideal since there exists only 1 replica for a period of time.

rock'n rolla rock'n rolla · Accepted Answer · 2021-03-28T22:13:41

I think you need to use descheduler. Along with few other use cases, one thing which it helps with is scenarios where pod/node affinity requirements are not satisfied anymore due to changes in the cluster.

The descheduler can be run as a Job or CronJob inside of a k8s cluster. It supports a few strategies and in your case, RemoveDuplicates strategy should be helpful. It's fairly straightforward to use. Have a look at the docs and the example configuration https://github.com/kubernetes-sigs/descheduler#removeduplicates

Pod anti-affinity and re-balancing pods

2 Answers