GKE Kubernetes Node Pool Upgrade very slow

Question

I am experimenting with GKE cluster upgrades in a 6 nodes (in two node pools) test cluster before I try it on our staging or production cluster. Upgrading when I only had a 12 replicas nginx deployment, the nginx ingress controller and cert-manager (as helm chart) installed took 10 minutes per node pool (3 nodes). I was very satisfied. I decided to try again with something that looks more like our setup. I removed the nginx deploy and added 2 node.js deployments, the following helm charts: mongodb-0.4.27, mcrouter-0.1.0 (as a statefulset), redis-ha-2.0.0, and my own www-redirect-0.0.1 chart (simple nginx which does redirect). The problem seems to be with mcrouter. Once the node starts draining, the status of that node changes to Ready,SchedulingDisabled (which seems normal) but the following pods remains:

mcrouter-memcached-0
fluentd-gcp-v2.0.9-4f87t
kube-proxy-gke-test-upgrade-cluster-default-pool-74f8edac-wblf

I do not know why those two kube-system pods remains, but that mcrouter is mine and it won't go quickly enough. If I wait long enough (1 hour+) then it eventually work, I am not sure why. The current node pool (of 3 nodes) started upgrading 2h46 minutes ago and 2 nodes are upgraded, the 3rd one is still upgrading but nothing is moving... I presume it will complete in the next 1-2 hours... I tried to run the drain command with --ignore-daemonsets --force but it told me it was already drained. I tried to delete the pods, but they just come back and the upgrade does not move any faster. Any thoughts?

Update #1

The mcrouter helm chart was installed like this:

helm install stable/mcrouter --name mcrouter --set controller=statefulset

The statefulsets it created for mcrouter part is:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  labels:
    app: mcrouter-mcrouter
    chart: mcrouter-0.1.0
    heritage: Tiller
    release: mcrouter
  name: mcrouter-mcrouter
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: mcrouter-mcrouter
      chart: mcrouter-0.1.0
      heritage: Tiller
      release: mcrouter
  serviceName: mcrouter-mcrouter
  template:
    metadata:
      labels:
        app: mcrouter-mcrouter
        chart: mcrouter-0.1.0
        heritage: Tiller
        release: mcrouter
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: mcrouter-mcrouter
                release: mcrouter
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -p 5000
        - --config-file=/etc/mcrouter/config.json
        command:
        - mcrouter
        image: jphalip/mcrouter:0.36.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: mcrouter-port
          timeoutSeconds: 5
        name: mcrouter-mcrouter
        ports:
        - containerPort: 5000
          name: mcrouter-port
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: mcrouter-port
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 256m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mcrouter
          name: config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: mcrouter-mcrouter
        name: config
  updateStrategy:
    type: OnDelete

and here is the memcached statefulset:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  labels:
    app: mcrouter-memcached
    chart: memcached-1.2.1
    heritage: Tiller
    release: mcrouter
  name: mcrouter-memcached
spec:
  podManagementPolicy: OrderedReady
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: mcrouter-memcached
      chart: memcached-1.2.1
      heritage: Tiller
      release: mcrouter
  serviceName: mcrouter-memcached
  template:
    metadata:
      labels:
        app: mcrouter-memcached
        chart: memcached-1.2.1
        heritage: Tiller
        release: mcrouter
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: mcrouter-memcached
                release: mcrouter
            topologyKey: kubernetes.io/hostname
      containers:
      - command:
        - memcached
        - -m 64
        - -o
        - modern
        - -v
        image: memcached:1.4.36-alpine
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: memcache
          timeoutSeconds: 5
        name: mcrouter-memcached
        ports:
        - containerPort: 11211
          name: memcache
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: memcache
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
  updateStrategy:
    type: OnDelete
status:
  replicas: 0

Can you share a configuration of your mcrouter-memcached-0 statefulset deployment? — Anton Kostenko
@AntonKostenko Sure thing! I updated my question above. Thanks for taking a look :) — roychri
@AntonKostenko The helm chart created 2 statefulsets, the one you asked for is the second one, yes. Thanks :) — roychri
@AntonKostenko Oh, I get what you mean now, I fixed my question! Thanks! :) — roychri

Anton Kostenko Anton Kostenko · Accepted Answer · 2018-03-21T19:39:41

That is a bit complex question and I am definitely not sure that it is like how I thinking, but... Let's try to understand what is happening.

You have an upgrade process and have 6 nodes in the cluster. The system will upgrade it one by one using Drain to remove all workload from the pod.

Drain process itself respecting your settings and number of replicas and desired state of workload has higher priority than the drain of the node itself.

During the drain process, Kubernetes will try to schedule all your workload on resources where scheduling available. Scheduling on a node which system want to drain is disabled, you can see it in its state - Ready,SchedulingDisabled.

So, Kubernetes scheduler trying to find a right place for your workload on all available nodes. It will wait as long as it needs to place everything you describe in a cluster configuration.

Now the most important thing. You set that you need replicas: 5 for your mcrouter-memcached. It cannot run more than one replica per node because of podAntiAffinity and a node for a running it should have enough resources for that, which is calculated using resources: block of ReplicaSet.

So, I think, that your cluster just does not has enough resource for a run new replica of mcrouter-memcached on the remaining 5 nodes. As an example, on the last node where a replica of it still not running, you have not enough memory because of other workloads.

I think if you will set replicaset for mcrouter-memcached to 4, it will solve a problem. Or you can try to use a bit more powerful instances for that workload, or add one more node to the cluster, it also should help.

Hope I gave enough explanation of my logic, ask me if something not clear to you. But first please try to solve an issue by provided solution:)

GKE Kubernetes Node Pool Upgrade very slow

Update #1

2 Answers