Sincere apologies for this lengthy posting.
I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes. I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.
kubectl get pods
was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?"
In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com
Note:
Please note as there were too many https links, I was not able to post the question. So I have renamed https:// to https:-- to avoid the links in the background analysis section.
I tried to run kubectl
from the master node and received similar error
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.
docker logs <container-id of kube-apiserver>
shows below errors
W0914 16:29:25.761524 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (context deadline exceeded)
systemctl status kubelet
--> was giving below errors
Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576 2411 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s: dial tcp 127.0.0.1:443: connect: connection refused
Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.
Background Analysis:
Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted. I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted. The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep.
Below are the docker logs from the exited kube-scheduler
docker container:
I0907 10:35:08.970384 1 scheduler.go:572] pod default/k8version-1599474900-hrjcn is bound successfully on node ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3 nodes were found feasible I0907 10:40:09.286831 1 scheduler.go:572] pod default/k8version-1599475200-tshlx is bound successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost master lost lease
Below are the docker logs from exited kube-controller docker container:
I0907 10:40:19.703485 1 garbagecollector.go:518] delete object [v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid: 67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy Background I0907 10:44:01.937398 1 leaderelection.go:263] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https: --127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I0907 10:44:01.937456 1 event.go:209] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1", ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720 stopped leading F0907 10:44:01.937545 1 controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I0907 10:44:01.949285 1 replica_set.go:194] Shutting down replicaset controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting down GC controller I0907 10:44:01.949304 1 pvc_protection_controller.go:111] Shutting down PVC protection controller I0907 10:44:01.949310 1 route_controller.go:125] Shutting down route controller I0907 10:44:01.949316 1 service_controller.go:197] Shutting down service controller I0907 10:44:01.949327 1 deployment_controller.go:164] Shutting down deployment controller I0907 10:44:01.949435 1 garbagecollector.go:148] Shutting down garbage collector controller I0907 10:44:01.949443 1 resource_quota_controller.go:295] Shutting down resource quota controller
Below are the docker logs from kube-controller since the restart (7th Sep):
E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:51:40.133446 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: Get https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: dial tcp 127.0.0.1:443: connect: connection refused
Below are the docker logs from kube-scheduler since the restart (7th Sep):
E0915 21:52:44.703587 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node: Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicationController: Get https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.705471 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.706477 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.707581 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.708599 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.709687 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.710744 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.711879 1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list *v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.712903 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
kube-apiserver certificate Renewal:
I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt
had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.
I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup. So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file. I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt
.
After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet. Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.
NOTE:
I have not restarted the docker on the master server after replacing the certificate.
NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.
Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.
Update to the original question:
Kubernetes version: v1.14.1 Docker version: 18.6.3
Below are the latest docker logs from kube-apiserver container
(which is still crashing)
F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)
Below is the output from systemctl status kubelet
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
This cluster (along with 3 others) was setup using kops. The other clusters are running normally and looks like they have some expired certificates as well. The person who setup the clusters is not available for comment and I have limited experience on Kubernetes. Hence required assistance from the gurus.
Any help is very much appreciated.
Many thanks.
Update after response from Zambozo and Nepomucen:
Thanks to both of you for your response. Based that I found that there were expired etcd certificates on the /mnt mount point.
I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/
and recreated etcd certificates and keys. I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.
Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)
Note:xxx-xx-xx-xxx is the IP address of the master server
root@ip-xxx-xx-xx-xxx:~#
docker logs <etcd-manager-main container> --tail 20
I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a" W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true] W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers [] I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a" I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]] I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts
root@ip-xxx-xx-xx-xxx:~#
docker logs <etcd-manager-events container> --tail 20
W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers [] I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a" W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true] I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes I0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstances I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]] I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a" W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true] W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
Could you please suggest on how to proceed from here?
Many thanks.