How to defrag resource utilization of GKE node with HPA and Cluster Autoscaler

Question

Using HPA (Horizontal Pod Autoscaler) and Cluster Autoscaler on GKE, pods and nodes are scaled up as expected. However, when demand decrease, pods are deleted from random nodes, it seems. It causes less utilized nodes. It is not cost effective...

EDIT: HPA is based on targetCPUUtilizationPercentage single metrics. Not using VPA.

This is reducted yaml file for deployment and HPA:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: foo
spec:
  replicas: 1
  templates:
    spec:
      containers:
      - name: c1
        resources:                                                                                                             
          requests:                                                                                                            
            cpu: 200m                                                                                                          
            memory: 1.2G                                                                                                       
      - name: C2
        resources:                                                                                                             
          requests:                                                                                                            
            cpu: 10m                                                                                                           
        volumeMounts:                                                                                                          
        - name: log-share                                                                                                      
          mountPath: /mnt/log-share                                                                                            
      - name: C3
        resources:
          requests:
            cpu: 10m
          limits:
            cpu: 100m
        - name: log-share                                                                                                      
          mountPath: /mnt/log-share                                                                                            
      volumes:
      - name: log-share
        emptyDir: {}

---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: foo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: foo
  minReplicas: 1
  maxReplicas: 60
  targetCPUUtilizationPercentage: 80
...

EDIT2: Add an emptyDir volume to be valid example.

How do I Improve this situation?

There are some ideas, but none of them solve the issue completely...

configure node pool machine type and pod resource request so that only one pod fit on a node. If a pod is deleted from a node by HPA, the node will be deleted after a period, but it doesn't work for deployments of various resource requests.
using preemptive nodes if possible...

On what metrics HPA is based, its one metric or you are using multiple metrics to autoscale? How did you configure this HPA? You are using only HPA or also VPA? Could you share your HPA YAML? Did you specify requests/limits, replicas in your deployment? — PjoterS
HPA is based on targetCPUUtilizationPercentage single metrics. HPA is configured with yaml and kubectl apply. Not using VPA. Yes, cpu and memory requests are specified and the value of replicas is 1. I edited the question. Thanks. — hiroshi
Ah, I noticed emptyDir volume may cause trouble. github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/… — hiroshi
With cluster autoscaling, the underutilized nodes should scale down (under 50% utilization). Are you trying to maintain nearly full nodes? — Patrick W
I'm trying to reduce nodes as less as theoretically possible. Say, if a node can have 4 pods and each node have 2, 3 and 3 pods, I think a pod on the 2 pods node can be moved to other nodes and going to be deleted I hope. — hiroshi

hiroshi hiroshi · Accepted Answer · 2020-03-31T02:18:27

Sorry, I failed to mention about use of emptyDir (edited yaml in the question).

As I commented on the question myself, I found What types of pods can prevent CA from removing a node? in the Autoscaler FAQ.

Pods with local storage. *

An emptyDir volume is a local storage, So I needed to add following annotation in the pod template of a deployment to mark the pod is safe to evict from less utilized nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: foo
spec:
  selector:
    matchLabels:
      app: foo
  template:
    metadata:
      labels:
        app: foo
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      ...

After specifying the annotation, the size of GCE instance group of the GKE node pool is smaller than before. I think it worked!

Thank you for everyone commented in the question!

How to defrag resource utilization of GKE node with HPA and Cluster Autoscaler

1 Answers