3
votes

For some reason Kubernetes 1.6.2 does not trigger autoscaling on Google Container Engine.

I have a someservice definition with the following resources and rolling update:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: someservice
  labels:
    layer: backend
spec:
  minReadySeconds: 160
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 0
    type: RollingUpdate  
  template:
    metadata:
      labels:
        name: someservice
        layer: backend        
    spec:
      containers:
      - name: someservice
        image: eu.gcr.io/XXXXXX/someservice:v1
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 2
            memory: 20Gi        
          requests:
            cpu: 400m
            memory: 18Gi
     <.....>

After changing image version, the new instance cannot start:

$ kubectl -n dev get pods -l name=someservice
NAME                      READY     STATUS    RESTARTS   AGE
someservice-2595684989-h8c5d   0/1       Pending   0          42m
someservice-804061866-f2trc    1/1       Running   0          1h

$ kubectl -n dev describe pod someservice-2595684989-h8c5d

Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  43m       43m     4   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (4), Insufficient memory (3).
  43m       42m     6   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (3), Insufficient memory (3).
  41m       41m     2   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (2), Insufficient memory (3).
  40m       36s     136 default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (1), Insufficient memory (3).
  43m       2s      243 cluster-autoscaler          Normal      NotTriggerScaleUp   pod didn't trigger scale-up (it wouldn't fit if a new node is added)

My node pool is set to autoscale with min: 2, max: 5. And machines (n1-highmem-8) in node pool are large enough (52GB) to accommodate this service. But somehow nothing happens:

$ kubectl get nodes
NAME                                 STATUS    AGE       VERSION
gke-dev-default-pool-efca0068-4qq1   Ready     2d        v1.6.2
gke-dev-default-pool-efca0068-597s   Ready     2d        v1.6.2
gke-dev-default-pool-efca0068-6srl   Ready     2d        v1.6.2
gke-dev-default-pool-efca0068-hb1z   Ready     2d        v1.6.2

$ kubectl  describe nodes | grep -A 4 'Allocated resources'
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests     Memory Limits
  ------------  ----------  ---------------     -------------
  7060m (88%)   15510m (193%)   39238591744 (71%)   48582818048 (88%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests Memory Limits
  ------------  ----------  --------------- -------------
  6330m (79%)   22200m (277%)   48930Mi (93%)   66344Mi (126%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests Memory Limits
  ------------  ----------  --------------- -------------
  7360m (92%)   13200m (165%)   49046Mi (93%)   44518Mi (85%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests     Memory Limits
  ------------  ----------  ---------------     -------------
  7988m (99%)   11538m (144%)   32967256Ki (61%)    21690968Ki (40%)

$ gcloud container node-pools describe  default-pool --cluster=dev
autoscaling:
  enabled: true
  maxNodeCount: 5
  minNodeCount: 2
config:
  diskSizeGb: 100
  imageType: COS
  machineType: n1-highmem-8
  oauthScopes:
  - https://www.googleapis.com/auth/compute
  - https://www.googleapis.com/auth/datastore
  - https://www.googleapis.com/auth/devstorage.read_only
  - https://www.googleapis.com/auth/devstorage.read_write
  - https://www.googleapis.com/auth/service.management.readonly
  - https://www.googleapis.com/auth/servicecontrol
  - https://www.googleapis.com/auth/sqlservice
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/monitoring
  serviceAccount: default
initialNodeCount: 2
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/XXXXXX/zones/europe-west1-b/instanceGroupManagers/gke-dev-default-pool-efca0068-grp
management:
  autoRepair: true
name: default-pool
selfLink: https://container.googleapis.com/v1/projects/XXXXXX/zones/europe-west1-b/clusters/dev/nodePools/default-pool
status: RUNNING
version: 1.6.2

$ kubectl -n dev get pods -l name=someservice
NAME                      READY     STATUS    RESTARTS   AGE
someservice-2595684989-h8c5d   0/1       Pending   0          42m
someservice-804061866-f2trc    1/1       Running   0          1h

$ kubectl -n dev describe pod someservice-2595684989-h8c5d

Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  43m       43m     4   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (4), Insufficient memory (3).
  43m       42m     6   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (3), Insufficient memory (3).
  41m       41m     2   default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (2), Insufficient memory (3).
  40m       36s     136 default-scheduler           Warning     FailedScheduling    No nodes are available that match all of the following predicates:: Insufficient cpu (1), Insufficient memory (3).
  43m       2s      243 cluster-autoscaler          Normal      NotTriggerScaleUp   pod didn't trigger scale-up (it wouldn't fit if a new node is added)


$ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:33:11Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
2

2 Answers

1
votes

So it seems that this is a bug with Kubernetes 1.6.2. According to GKE support engineer:

From the messages "No nodes are available that match all of the following predicates", this seems to be a known issue and the engineers managed to track down the root cause. It was an issue in cluster autoscaler version 0.5.1 that is currently used in GKE 1.6 (up to 1.6.2). This issue had been fixed already in cluster autoscaler 0.5.2, which is included in head for the 1.6 branch.

1
votes

Make sure instance group autoscaler either is disabled or has proper minimum/maximum number of instances settings.

According to Kubernetes Cluster Autoscaler FAQ:

CPU-based (or any metric-based) cluster/node group autoscalers, like GCE Instance Group Autoscaler, are NOT compatible with [Kubernetes Cluster Austoscaler]. They are also not particularly suited to use with Kubernetes in general.

...so it should probably be disabled.

Try:

gcloud compute instance-groups managed describe gke-dev-default-pool-efca0068-grp \
    --zone europe-west1-b

Then check out autoscaler property. It will be absent if instance group autoscaler is disabled.

To disable it, do:

gcloud compute instance-groups managed stop-autoscaling gke-dev-default-pool-efca0068-grp \
    --zone europe-west1-b