9
votes

I have a multi-regional testing setup on GKE k8s 1.9.4. Every cluster has:

  • an ingress, configured with kubemci
  • 3 node pools with different node labels:
    • default-pool system (1vCPU / 2GB RAM)
    • frontend-pool frontend (2vCPU / 2GB RAM)
    • backend-pool backend (1vCPU / 600Mb RAM)
  • HPA with scaling by the custom metric

So stuff like prometheus-operator, prometheus-server, custom-metrics-api-server and kube-state-metrics attached to a node with system label.

Frontend and backend pod attached to nodes with frontend and backend labels respectively (single pod to a single node), see podantiaffinity.

After autoscaling scales backend or frontend pods down, them nodes remains to stay, as there appear to be pods from kube-system namespace, i.e heapster. This leads to a situation when node with frontend / backend label stays alive after downscaling even there's no backend or frontend pod left on it.

The question is: how can I avoid creating kube-system pods on the nodes, that serving my application (if this is really sane and possible)?

Guess, I should use taints and tolerations for backend and frontend nodes, but how it can be combined with HPA and in-cluster node autoscaler?

2

2 Answers

6
votes

Seems like taints and tolerations did the trick.

Create a cluster with a default node pool (for monitoring and kube-system):

gcloud container --project "my-project-id" clusters create "app-europe" \
  --zone "europe-west1-b" --username="admin" --cluster-version "1.9.4-gke.1" --machine-type "custom-2-4096" \
  --image-type "COS" --disk-size "10" --num-nodes "1" --network "default" --enable-cloud-logging --enable-cloud-monitoring \
  --maintenance-window "01:00" --node-labels=region=europe-west1,role=system

Create node pool for your application:

gcloud container --project "my-project-id" node-pools create "frontend" \
      --cluster "app-europe" --zone "europe-west1-b" --machine-type "custom-2-2048" --image-type "COS" \
      --disk-size "10" --node-labels=region=europe-west1,role=frontend \
      --node-taints app=frontend:NoSchedule \
      --enable-autoscaling --num-nodes "1" --min-nodes="1" --max-nodes="3"

then add nodeAffinity and tolerations sections to a pods template spec in your deployment manifest:

  tolerations:
  - key: "app"
    operator: "Equal"
    value: "frontend"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: beta.kubernetes.io/instance-type
            operator: In
            values:
            - custom-2-2048
        - matchExpressions:
          - key: role
            operator: In
            values:
            - frontend
0
votes

The first thing I would recommend to check is that the amount of requested resources you have in PodSpec is enough to carry the load and that there is enough resources on system nodes to schedule all system pods.

You may try to prevent scheduling system pods to frontend or backend autoscaled nodes using either more simple nodeSelector or more flexible Node Affinity.

You can find good explanation and examples in document “Assigning Pods to Nodes

Taints and Toleration features are similar to Node Affinity, but more from node perspective. They allow a node to repel a set of pods. Check the document “Taints and Tolerations” if you choose this way.

When you create node pool for autoscaling you can add labels and taints, so they will apply to nodes when Cluster Autoscaler (CA) upscale the pool.

In addition to restricting system pods from scheduling on frontend/backend nodes it would be a good idea to configure PodDisruptionBudget and autoscaler safe-to-evict option for pods that could prevent CA from removing a node during downscale.

According to Cluster Autoscaler FAQ there are several types of pods that may prevent CA to downscale your cluster:

  • Pods with restrictive PodDisruptionBudget (PDB).
  • Kube-system pods that:
    • are not run on the node by default,
    • don't have PDB or their PDB is too restrictive (since CA 0.6).
  • Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc).
  • Pods with local storage. *
  • Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)

*Unless the pod has the following annotation (supported in CA 1.0.3 or later):

"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Prior to version 0.6, Cluster Autoscaler was not touching nodes that were running important kube-system pods like DNS, Heapster, Dashboard etc.
If these pods landed on different nodes, CA could not scale the cluster down and the user could end up with a completely empty 3 node cluster.
In 0.6, was added an option to tell CA that some system pods can be moved around. If the user configures a PodDisruptionBudget for the kube-system pod, then the default strategy of not touching the node running this pod is overridden with PDB settings.
So, to enable kube-system pods migration, one should set minAvailable to 0 (or <= N if there are N+1 pod replicas.)
See also I have a couple of nodes with low utilization, but they are not scaled down. Why?

Cluster Autoscaler FAQ can help you choose correct version for you cluster.

To have better understanding of what is laying under the hood of Cluster Autoscaler check the official documentation