Let me preface this by saying this is running on a production cluster, so any 'destructive' solution that will cause downtime will not be an option (unless absolutely necessary).
My environment
I have a Kubernetes cluster (11 nodes, 3 of which are master nodes) running v1.13.1 on AWS. This cluster was created via kOps like so:
kops create cluster \
--yes \
--authorization RBAC \
--cloud aws \
--networking calico \
...
I don't think this is relevant, but everything on the cluster has been installed via helm3.
Here are my exact versions:
$ helm version
version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.5"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:31:33Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
$ kops version
Version 1.18.2
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-2-147-44.ec2.internal Ready node 47h v1.13.1
ip-10-2-149-115.ec2.internal Ready node 47h v1.13.1
ip-10-2-150-124.ec2.internal Ready master 2d v1.13.1
ip-10-2-151-33.ec2.internal Ready node 47h v1.13.1
ip-10-2-167-145.ec2.internal Ready master 43h v1.18.14
ip-10-2-167-162.ec2.internal Ready node 2d v1.13.1
ip-10-2-172-248.ec2.internal Ready node 47h v1.13.1
ip-10-2-173-134.ec2.internal Ready node 47h v1.13.1
ip-10-2-177-100.ec2.internal Ready master 2d v1.13.1
ip-10-2-181-235.ec2.internal Ready node 47h v1.13.1
ip-10-2-182-14.ec2.internal Ready node 47h v1.13.1
What I am attempting to do
I am trying to update the cluster from v1.13.1
-> v1.18.14
I edited the config by
$ kops edit cluster
and changed
kubernetesVersion: 1.18.14
then I ran
kops update cluster --yes
kops rolling-update cluster --yes
Which then starting the rolling-update process.
NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES
master-us-east-1a NeedsUpdate 1 0 1 1 1 1
master-us-east-1b NeedsUpdate 1 0 1 1 1 1
master-us-east-1c NeedsUpdate 1 0 1 1 1 1
nodes NeedsUpdate 8 0 8 8 8 8
The problem:
The process gets stuck on the first node upgrade with this error
I0108 10:48:40.137256 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": master "ip-10-2-167-145.ec2.internal" is not ready, system-node-critical pod "calico-node-m255f" is not ready (calico-node).
I0108 10:49:12.474458 59317 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-node-critical pod "calico-node-m255f" is not ready (calico-node).
calico-node-m255f
is the only calico node in the cluster (I'm pretty sure there should be one for each k8s node?)
Info on that pod:
$ kubectl get pods -n kube-system -o wide | grep calico-node
calico-node-m255f 0/1 Running 0 35m 10.2.167.145 ip-10-2-167-145.ec2.internal <none> <none>
$ kubectl describe pod calico-node-m255f -n kube-system
Name: calico-node-m255f
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: ip-10-2-167-145.ec2.internal/10.2.167.145
Start Time: Fri, 08 Jan 2021 10:18:05 -0800
Labels: controller-revision-hash=59875785d9
k8s-app=calico-node
pod-template-generation=5
role.kubernetes.io/networking=1
Annotations: <none>
Status: Running
IP: 10.2.167.145
IPs: <none>
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://9a6d035ee4a9d881574f45075e033597a33118e1ed2c964204cc2a5b175fbc60
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:06 -0800
Finished: Fri, 08 Jan 2021 10:18:06 -0800
Ready: True
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
install-cni:
Container ID: docker://5788e3519a2b1c1b77824dbfa090ad387e27d5bb16b751c3cf7637a7154ac576
Image: calico/cni:v3.15.3
Image ID: docker-pullable://calico/cni@sha256:519e5c74c3c801ee337ca49b95b47153e01fd02b7d2797c601aeda48dc6367ff
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:07 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
flexvol-driver:
Container ID: docker://bc8ad32a2dd0eb5bbb21843d4d248171bc117d2eede9e1efa9512026d9205888
Image: calico/pod2daemon-flexvol:v3.15.3
Image ID: docker-pullable://calico/pod2daemon-flexvol@sha256:cec7a31b08ab5f9b1ed14053b91fd08be83f58ddba0577e9dabd8b150a51233f
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 08 Jan 2021 10:18:08 -0800
Finished: Fri, 08 Jan 2021 10:18:08 -0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Containers:
calico-node:
Container ID: docker://8911e4bdc0e60aa5f6c553c0e0d0e5f7aa981d62884141120d8f7cc5bc079884
Image: calico/node:v3.15.3
Image ID: docker-pullable://calico/node@sha256:1d674438fd05bd63162d9c7b732d51ed201ee7f6331458074e3639f4437e34b1
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 08 Jan 2021 10:18:09 -0800
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: kops,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
CALICO_IPV4POOL_VXLAN: Never
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_VXLANMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
FELIX_WIREGUARDMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 100.96.0.0/11
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
FELIX_IPTABLESBACKEND: Auto
FELIX_PROMETHEUSMETRICSENABLED: false
FELIX_PROMETHEUSMETRICSPORT: 9091
FELIX_PROMETHEUSGOMETRICSENABLED: true
FELIX_PROMETHEUSPROCESSMETRICSENABLED: true
FELIX_WIREGUARDENABLED: false
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-mnnrd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-mnnrd:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-mnnrd
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35m default-scheduler Successfully assigned kube-system/calico-node-m255f to ip-10-2-167-145.ec2.internal
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container upgrade-ipam
Normal Started 35m kubelet Started container upgrade-ipam
Normal Started 35m kubelet Started container install-cni
Normal Pulled 35m kubelet Container image "calico/cni:v3.15.3" already present on machine
Normal Created 35m kubelet Created container install-cni
Normal Pulled 35m kubelet Container image "calico/pod2daemon-flexvol:v3.15.3" already present on machine
Normal Created 35m kubelet Created container flexvol-driver
Normal Started 35m kubelet Started container flexvol-driver
Normal Started 35m kubelet Started container calico-node
Normal Pulled 35m kubelet Container image "calico/node:v3.15.3" already present on machine
Normal Created 35m kubelet Created container calico-node
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:12.731 [INFO][130] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:22.727 [INFO][169] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:32.733 [INFO][207] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:42.730 [INFO][237] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 35m kubelet Readiness probe failed: 2021-01-08 18:18:52.736 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:02.731 [INFO][294] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:12.734 [INFO][318] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:22.739 [INFO][360] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 34m kubelet Readiness probe failed: 2021-01-08 18:19:32.748 [INFO][391] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
Warning Unhealthy 45s (x202 over 34m) kubelet (combined from similar events): Readiness probe failed: 2021-01-08 18:53:12.726 [INFO][6053] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.2.147.44,10.2.149.115,10.2.150.124,10.2.151.33,10.2.167.162,10.2.172.248,10.2.173.134,10.2.177.100,10.2.181.235,10.2.182.14
I can ssh into the node and check calico from there
$ sudo ./calicoctl-linux-amd64 node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.2.147.44 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.149.115 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.150.124 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.151.33 | node-to-node mesh | start | 00:21:18 | Active Socket: Connection |
| | | | | refused |
| 10.2.167.162 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.172.248 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.173.134 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.177.100 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.181.235 | node-to-node mesh | start | 00:21:18 | Passive |
| 10.2.182.14 | node-to-node mesh | start | 00:21:18 | Passive |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.
Here's the calico-node DaemonSet configuration (I assume this was generated by kops and has been untouched)
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: calico-node
namespace: kube-system
selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/calico-node
uid: 33dfb80a-c840-11e9-af87-02fc30bb40d6
resourceVersion: '142850829'
generation: 5
creationTimestamp: '2019-08-26T20:29:28Z'
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
annotations:
deprecated.daemonset.template.generation: '5'
kubectl.kubernetes.io/last-applied-configuration: '[cut out to save space]'
spec:
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
creationTimestamp: null
labels:
k8s-app: calico-node
role.kubernetes.io/networking: '1'
spec:
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
type: ''
- name: var-run-calico
hostPath:
path: /var/run/calico
type: ''
- name: var-lib-calico
hostPath:
path: /var/lib/calico
type: ''
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: cni-bin-dir
hostPath:
path: /opt/cni/bin
type: ''
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
type: ''
- name: host-local-net-dir
hostPath:
path: /var/lib/cni/networks
type: ''
- name: policysync
hostPath:
path: /var/run/nodeagent
type: DirectoryOrCreate
- name: flexvol-driver-host
hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
type: DirectoryOrCreate
initContainers:
- name: upgrade-ipam
image: 'calico/cni:v3.15.3'
command:
- /opt/cni/bin/calico-ipam
- '-upgrade'
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
resources: {}
volumeMounts:
- name: host-local-net-dir
mountPath: /var/lib/cni/networks
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: install-cni
image: 'calico/cni:v3.15.3'
command:
- /install-cni.sh
env:
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
name: calico-config
key: cni_network_config
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CNI_MTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: SLEEP
value: 'false'
resources: {}
volumeMounts:
- name: cni-bin-dir
mountPath: /host/opt/cni/bin
- name: cni-net-dir
mountPath: /host/etc/cni/net.d
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
- name: flexvol-driver
image: 'calico/pod2daemon-flexvol:v3.15.3'
resources: {}
volumeMounts:
- name: flexvol-driver-host
mountPath: /host/driver
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
containers:
- name: calico-node
image: 'calico/node:v3.15.3'
env:
- name: DATASTORE_TYPE
value: kubernetes
- name: WAIT_FOR_DATASTORE
value: 'true'
- name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
- name: CLUSTER_TYPE
value: 'kops,bgp'
- name: IP
value: autodetect
- name: CALICO_IPV4POOL_IPIP
value: Always
- name: CALICO_IPV4POOL_VXLAN
value: Never
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_VXLANMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: FELIX_WIREGUARDMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
- name: CALICO_IPV4POOL_CIDR
value: 100.96.0.0/11
- name: CALICO_DISABLE_FILE_LOGGING
value: 'true'
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: ACCEPT
- name: FELIX_IPV6SUPPORT
value: 'false'
- name: FELIX_LOGSEVERITYSCREEN
value: info
- name: FELIX_HEALTHENABLED
value: 'true'
- name: FELIX_IPTABLESBACKEND
value: Auto
- name: FELIX_PROMETHEUSMETRICSENABLED
value: 'false'
- name: FELIX_PROMETHEUSMETRICSPORT
value: '9091'
- name: FELIX_PROMETHEUSGOMETRICSENABLED
value: 'true'
- name: FELIX_PROMETHEUSPROCESSMETRICSENABLED
value: 'true'
- name: FELIX_WIREGUARDENABLED
value: 'false'
resources:
requests:
cpu: 100m
volumeMounts:
- name: lib-modules
readOnly: true
mountPath: /lib/modules
- name: xtables-lock
mountPath: /run/xtables.lock
- name: var-run-calico
mountPath: /var/run/calico
- name: var-lib-calico
mountPath: /var/lib/calico
- name: policysync
mountPath: /var/run/nodeagent
livenessProbe:
exec:
command:
- /bin/calico-node
- '-felix-live'
- '-bird-live'
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- '-felix-ready'
- '-bird-ready'
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
procMount: Default
restartPolicy: Always
terminationGracePeriodSeconds: 0
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: calico-node
serviceAccount: calico-node
hostNetwork: true
securityContext: {}
schedulerName: default-scheduler
tolerations:
- operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- operator: Exists
effect: NoExecute
priorityClassName: system-node-critical
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
revisionHistoryLimit: 10
status:
currentNumberScheduled: 1
numberMisscheduled: 0
desiredNumberScheduled: 1
numberReady: 0
observedGeneration: 5
updatedNumberScheduled: 1
numberUnavailable: 1
There's nothing really useful in the pod logs either; no errors or anything obvious. It mostly looks like this:
2021-01-08 19:08:21.603 [INFO][48] felix/int_dataplane.go 1245: Applying dataplane updates
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 223: Asked to resync with the dataplane on next update. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/ipsets.go 306: Resyncing ipsets with dataplane. family="inet"
2021-01-08 19:08:21.603 [INFO][48] felix/wireguard.go 578: Wireguard is not enabled
2021-01-08 19:08:21.605 [INFO][48] felix/ipsets.go 356: Finished resync family="inet" numInconsistenciesFound=0 resyncDuration=1.573324ms
2021-01-08 19:08:21.605 [INFO][48] felix/int_dataplane.go 1259: Finished applying updates to dataplane. msecToApply=2.03915
Things I tried
Unfortunately, I'm not a networking expert so I didn't get too deep into the specifics of calico.
I have tried rebooting the related pods, rebooting the actual EC2 instance, and deleting the daemonset and re-adding it using the above config.
I can also assure you there are no networking restrictions (firewalls, sec groups, etc) within the internal network that might be blocking connections.
It's also worth pointing out that this cluster was working perfectly before the kops rolling-update
attempt.
I'm pretty much at a roadblock here and not sure what else I could try.