5
votes

I have a cluster running on GCP that currently consists entirely of preemtible nodes. We're experiencing issues where kube-dns becomes unavailable (presumably because a node has been preempted). We'd like to improve the resilience of DNS by moving kube-dns pods to more stable nodes.

Is it possible to schedule system cluster critical pods like kube-dns (or all pods in the kube-system namespace) on a node pool of only non-preemptible nodes? I'm wary of using affinity or anti-affinity or taints, since these pods are auto-created at cluster bootstrapping and any changes made could be clobbered by a Kubernetes version upgrade. Is there a way do do this that will persist across upgrades?

1

1 Answers

4
votes

The solution was to use taints and tolerations in conjunction with node affinity. We created a second node pool, and added a taint to the preemptible pool.

Terraform config:

resource "google_container_node_pool" "preemptible_worker_pool" {
  node_config {
    ...
    preemptible     = true

    labels {
      preemptible = "true"
      dedicated   = "preemptible-worker-pool"
    }

    taint {
      key    = "dedicated"
      value  = "preemptible-worker-pool"
      effect = "NO_SCHEDULE"
    }
  }
}

We then used a toleration and nodeAffinity to allow our existing workloads to run on the tainted node pool, effectively forcing the cluster-critical pods to run on the untainted (non-preemtible) node pool.

Kubernetes config:

spec:
  template:
    spec:
      # The affinity + tolerations sections together allow and enforce that the workers are
      # run on dedicated nodes tainted with "dedicated=preemptible-worker-pool:NoSchedule".
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: dedicated
                operator: In
                values:
                - preemptible-worker-pool
      tolerations:
      - key: dedicated
        operator: "Equal"
        value: preemptible-worker-pool
        effect: "NoSchedule"