0
votes

Let me preface this by saying this is running on a production cluster, so any 'destructive' solution that will cause downtime will not be an option (unless absolutely necessary).

My environment

I have a Kubernetes cluster (11 nodes, 3 of which are master nodes) running v1.13.1 on AWS. This cluster was created via kOps like so:

kops create cluster \
    --yes \
    --authorization RBAC \
    --cloud aws \
    --networking calico \
    ...

I don't think this is relevant, but everything on the cluster has been installed via helm3.

Here are my exact versions:

$ helm version
version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.5"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
$ kops version
Version 1.18.2
$ kubectl get nodes                                                                                                                                                                            
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-2-147-44.ec2.internal    Ready    node     47h   v1.13.1
ip-10-2-149-115.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-150-124.ec2.internal   Ready    master   2d    v1.13.1
ip-10-2-151-33.ec2.internal    Ready    node     47h   v1.13.1
ip-10-2-167-145.ec2.internal   Ready    master   43h   v1.18.14
ip-10-2-167-162.ec2.internal   Ready    node     2d    v1.13.1
ip-10-2-172-248.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-173-134.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-177-100.ec2.internal   Ready    master   2d    v1.13.1
ip-10-2-181-235.ec2.internal   Ready    node     47h   v1.13.1
ip-10-2-182-14.ec2.internal    Ready    node     47h   v1.13.1

What I am attempting to do

I am trying to update the cluster from v1.13.1 -> v1.18.14

I edited the config by

$ kops edit cluster

and changed

kubernetesVersion: 1.18.14

then I ran

kops update cluster --yes
kops rolling-update cluster --yes

Which then starting the rolling-update process.

NAME                STATUS        NEEDUPDATE    READY   MIN   TARGET   MAX   NODES
master-us-east-1a   NeedsUpdate   1             0       1     1        1     1
master-us-east-1b   NeedsUpdate   1             0       1     1        1     1
master-us-east-1c   NeedsUpdate   1             0       1     1        1     1
nodes               NeedsUpdate   8             0       8     8        8     8

The problem:

The process gets stuck on the first node upgrade with this error

I0108 10:48:40.137256   59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": master "ip-10-2-167-145.ec2.internal" is not ready, system-node-critical pod "calico-node-m255f" is not ready (calico-node).
I0108 10:49:12.474458   59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-node-critical pod "calico-node-m255f" is not ready (calico-node).

calico-node-m255f is the only calico node in the cluster (I'm pretty sure there should be one for each k8s node?)

Info on that pod:

$ kubectl get pods -n kube-system -o wide | grep calico-node
calico-node-m255f                                            0/1     Running             0          35m   10.2.167.145      ip-10-2-167-145.ec2.internal   <none>           <none>

$ kubectl describe pod calico-node-m255f -n kube-system

Name:                 calico-node-m255f
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-2-167-145.ec2.internal/10.2.167.145
Start Time:           Fri, 08 Jan 2021 10:18:05 -0800
Labels:               controller-revision-hash=59875785d9
                      k8s-app=calico-node
                      pod-template-generation=5
                      role.kubernetes.io/networking=1
Annotations:          <none>
Status:               Running
IP:                   10.2.167.145
IPs:                  <none>
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://9a6d035ee4a9d881574f45075e033597a33118e1ed2c964204cc2a5b175fbc60
    Image:         calico/cni:v3.15.3
    Image ID:      docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:06 -0800
      Finished:     Fri, 08 Jan 2021 10:18:06 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
  install-cni:
    Container ID:  docker://5788e3519a2b1c1b77824dbfa090ad387e27d5bb16b751c3cf7637a7154ac576
    Image:         calico/cni:v3.15.3
    Image ID:      docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:07 -0800
      Finished:     Fri, 08 Jan 2021 10:18:08 -0800
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
  flexvol-driver:
    Container ID:   docker://bc8ad32a2dd0eb5bbb21843d4d248171bc117d2eede9e1efa9512026d9205888
    Image:          calico/pod2daemon-flexvol:v3.15.3
    Image ID:       docker-pullable://calico/pod2daemon-flexvol@sha256:cec7a31b08ab5f9b1ed14053b91fd08be83f58ddba0577e9dabd8b150a51233f
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 08 Jan 2021 10:18:08 -0800
      Finished:     Fri, 08 Jan 2021 10:18:08 -0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Containers:
  calico-node:
    Container ID:   docker://8911e4bdc0e60aa5f6c553c0e0d0e5f7aa981d62884141120d8f7cc5bc079884
    Image:          calico/node:v3.15.3
    Image ID:       docker-pullable://calico/node@sha256:1d674438fd05bd63162d9c7b732d51ed201ee7f6331458074e3639f4437e34b1
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 08 Jan 2021 10:18:09 -0800
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                         kubernetes
      WAIT_FOR_DATASTORE:                     true
      NODENAME:                                (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           kops,bgp
      IP:                                     autodetect
      CALICO_IPV4POOL_IPIP:                   Always
      CALICO_IPV4POOL_VXLAN:                  Never
      FELIX_IPINIPMTU:                        <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_VXLANMTU:                         <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      FELIX_WIREGUARDMTU:                     <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:                   100.96.0.0/11
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      ACCEPT
      FELIX_IPV6SUPPORT:                      false
      FELIX_LOGSEVERITYSCREEN:                info
      FELIX_HEALTHENABLED:                    true
      FELIX_IPTABLESBACKEND:                  Auto
      FELIX_PROMETHEUSMETRICSENABLED:         false
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       true
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  true
      FELIX_WIREGUARDENABLED:                 false
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-mnnrd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-mnnrd
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  35m   default-scheduler  Successfully assigned kube-system/calico-node-m255f to ip-10-2-167-145.ec2.internal
  Normal   Pulled     35m   kubelet            Container image "calico/cni:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container upgrade-ipam
  Normal   Started    35m   kubelet            Started container upgrade-ipam
  Normal   Started    35m   kubelet            Started container install-cni
  Normal   Pulled     35m   kubelet            Container image "calico/cni:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container install-cni
  Normal   Pulled     35m   kubelet            Container image "calico/pod2daemon-flexvol:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container flexvol-driver
  Normal   Started    35m   kubelet            Started container flexvol-driver
  Normal   Started    35m   kubelet            Started container calico-node
  Normal   Pulled     35m   kubelet            Container image "calico/node:v3.15.3" already present on machine
  Normal   Created    35m   kubelet            Created container calico-node
  Warning  Unhealthy  35m   kubelet            Readiness probe failed: 2021-01-08 18:18:12.731 [INFO][130] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:22.727 [INFO][169] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:32.733 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:42.730 [INFO][237] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  35m  kubelet  Readiness probe failed: 2021-01-08 18:18:52.736 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:02.731 [INFO][294] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:12.734 [INFO][318] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:22.739 [INFO][360] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  34m  kubelet  Readiness probe failed: 2021-01-08 18:19:32.748 [INFO][391] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
  Warning  Unhealthy  45s (x202 over 34m)  kubelet  (combined from similar events): Readiness probe failed: 2021-01-08 18:53:12.726 [INFO][6053] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14

I can ssh into the node and check calico from there

$ sudo ./calicoctl-linux-amd64 node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.2.147.44  | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.149.115 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.150.124 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.151.33  | node-to-node mesh | start | 00:21:18 | Active Socket: Connection      |
|              |                   |       |          | refused                        |
| 10.2.167.162 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.172.248 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.173.134 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.177.100 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.181.235 | node-to-node mesh | start | 00:21:18 | Passive                        |
| 10.2.182.14  | node-to-node mesh | start | 00:21:18 | Passive                        |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.

Here's the calico-node DaemonSet configuration (I assume this was generated by kops and has been untouched)

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: calico-node
  namespace: kube-system
  selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/calico-node
  uid: 33dfb80a-c840-11e9-af87-02fc30bb40d6
  resourceVersion: '142850829'
  generation: 5
  creationTimestamp: '2019-08-26T20:29:28Z'
  labels:
    k8s-app: calico-node
    role.kubernetes.io/networking: '1'
  annotations:
    deprecated.daemonset.template.generation: '5'
    kubectl.kubernetes.io/last-applied-configuration: '[cut out to save space]'
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
        role.kubernetes.io/networking: '1'
    spec:
      volumes:
        - name: lib-modules
          hostPath:
            path: /lib/modules
            type: ''
        - name: var-run-calico
          hostPath:
            path: /var/run/calico
            type: ''
        - name: var-lib-calico
          hostPath:
            path: /var/lib/calico
            type: ''
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
            type: ''
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
            type: ''
        - name: host-local-net-dir
          hostPath:
            path: /var/lib/cni/networks
            type: ''
        - name: policysync
          hostPath:
            path: /var/run/nodeagent
            type: DirectoryOrCreate
        - name: flexvol-driver-host
          hostPath:
            path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
            type: DirectoryOrCreate
      initContainers:
        - name: upgrade-ipam
          image: 'calico/cni:v3.15.3'
          command:
            - /opt/cni/bin/calico-ipam
            - '-upgrade'
          env:
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
          resources: {}
          volumeMounts:
            - name: host-local-net-dir
              mountPath: /var/lib/cni/networks
            - name: cni-bin-dir
              mountPath: /host/opt/cni/bin
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
        - name: install-cni
          image: 'calico/cni:v3.15.3'
          command:
            - /install-cni.sh
          env:
            - name: CNI_CONF_NAME
              value: 10-calico.conflist
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CNI_MTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: SLEEP
              value: 'false'
          resources: {}
          volumeMounts:
            - name: cni-bin-dir
              mountPath: /host/opt/cni/bin
            - name: cni-net-dir
              mountPath: /host/etc/cni/net.d
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
        - name: flexvol-driver
          image: 'calico/pod2daemon-flexvol:v3.15.3'
          resources: {}
          volumeMounts:
            - name: flexvol-driver-host
              mountPath: /host/driver
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
      containers:
        - name: calico-node
          image: 'calico/node:v3.15.3'
          env:
            - name: DATASTORE_TYPE
              value: kubernetes
            - name: WAIT_FOR_DATASTORE
              value: 'true'
            - name: NODENAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: CALICO_NETWORKING_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: calico_backend
            - name: CLUSTER_TYPE
              value: 'kops,bgp'
            - name: IP
              value: autodetect
            - name: CALICO_IPV4POOL_IPIP
              value: Always
            - name: CALICO_IPV4POOL_VXLAN
              value: Never
            - name: FELIX_IPINIPMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: FELIX_VXLANMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: FELIX_WIREGUARDMTU
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: veth_mtu
            - name: CALICO_IPV4POOL_CIDR
              value: 100.96.0.0/11
            - name: CALICO_DISABLE_FILE_LOGGING
              value: 'true'
            - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
              value: ACCEPT
            - name: FELIX_IPV6SUPPORT
              value: 'false'
            - name: FELIX_LOGSEVERITYSCREEN
              value: info
            - name: FELIX_HEALTHENABLED
              value: 'true'
            - name: FELIX_IPTABLESBACKEND
              value: Auto
            - name: FELIX_PROMETHEUSMETRICSENABLED
              value: 'false'
            - name: FELIX_PROMETHEUSMETRICSPORT
              value: '9091'
            - name: FELIX_PROMETHEUSGOMETRICSENABLED
              value: 'true'
            - name: FELIX_PROMETHEUSPROCESSMETRICSENABLED
              value: 'true'
            - name: FELIX_WIREGUARDENABLED
              value: 'false'
          resources:
            requests:
              cpu: 100m
          volumeMounts:
            - name: lib-modules
              readOnly: true
              mountPath: /lib/modules
            - name: xtables-lock
              mountPath: /run/xtables.lock
            - name: var-run-calico
              mountPath: /var/run/calico
            - name: var-lib-calico
              mountPath: /var/lib/calico
            - name: policysync
              mountPath: /var/run/nodeagent
          livenessProbe:
            exec:
              command:
                - /bin/calico-node
                - '-felix-live'
                - '-bird-live'
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
                - /bin/calico-node
                - '-felix-ready'
                - '-bird-ready'
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            procMount: Default
      restartPolicy: Always
      terminationGracePeriodSeconds: 0
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: calico-node
      serviceAccount: calico-node
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - operator: Exists
          effect: NoSchedule
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
      priorityClassName: system-node-critical
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10
status:
  currentNumberScheduled: 1
  numberMisscheduled: 0
  desiredNumberScheduled: 1
  numberReady: 0
  observedGeneration: 5
  updatedNumberScheduled: 1
  numberUnavailable: 1

There's nothing really useful in the pod logs either; no errors or anything obvious. It mostly looks like this:

2021-01-08 19:08:21.603 [INFO][48] felix/int_dataplane.go 1245: Applying dataplane updates
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 223: Asked to resync with the dataplane on next update. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 306: Resyncing ipsets with dataplane. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/wireguard.go 578: Wireguard is not enabled
2021-01-08 19:08:21.605 [INFO][48] felix/ipsets.go 356: Finished resync family="inet" numInconsistenciesFound=0 resyncDuration=1.573324ms
2021-01-08 19:08:21.605 [INFO][48] felix/int_dataplane.go 1259: Finished applying updates to dataplane. msecToApply=2.03915

Things I tried

Unfortunately, I'm not a networking expert so I didn't get too deep into the specifics of calico.

I have tried rebooting the related pods, rebooting the actual EC2 instance, and deleting the daemonset and re-adding it using the above config.

I can also assure you there are no networking restrictions (firewalls, sec groups, etc) within the internal network that might be blocking connections.

It's also worth pointing out that this cluster was working perfectly before the kops rolling-update attempt.

I'm pretty much at a roadblock here and not sure what else I could try.

2

2 Answers

1
votes

I have solved this by updating all the masters at the same time, without validation

kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --node-interval=1s

Everything is working now!

1
votes

I don't have a definitive answer why this happened. But if you jumped from k8s 1.13 directly to 1.18, skipped a number of incremental changes and that may have caused the issues you are seeing.

While it is safe to always use the latest kOps version (as long as it supports the k8s version you are using), k8s itself only supports jumping minor version by minor version: https://kubernetes.io/docs/setup/release/version-skew-policy/