1
votes

I have a Kubernetes cluster set up using kubeadm. I installed prometheus and node-exporter on top of it based on:

The pods seem to be running properly:

 kubectl get pods --namespace=monitoring -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
node-exporter-jk2sd                      1/1     Running   0          90m   192.168.5.20   work03   <none>           <none>
node-exporter-jldrx                      1/1     Running   0          90m   192.168.5.17   work04   <none>           <none>
node-exporter-mgtld                      1/1     Running   0          90m   192.168.5.15   work01   <none>           <none>
node-exporter-tq7bx                      1/1     Running   0          90m   192.168.5.41   work02   <none>           <none>
prometheus-deployment-5d79b5f65b-tkpd2   1/1     Running   0          91m   192.168.5.40   work02   <none>           <none>

I can see the endpoints, as well:

kubectl get endpoints -n monitoring
NAME            ENDPOINTS                                                           AGE
node-exporter   192.168.5.15:9100,192.168.5.17:9100,192.168.5.20:9100 + 1 more...   5m3s

I also did: kubectl port-forward prometheus-deployment-5d79b5f65b-tkpd2 8080:9090 -n monitoring and when I access the prometheus web UI > Status > Targets, I don't find node-exporters there. When I start typing a query for a metric reported by node-exporter, it doesn't automatically show up in the query editor.

Logs coming from the prometheus pod seem to have a lot of errors:

kubectl logs prometheus-deployment-5d79b5f65b-tkpd2 -n monitoring
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:428 msg="Starting Prometheus" version="(version=2.29.1, branch=HEAD, revision=dcb07e8eac34b5ea37cd229545000b857f1c1637)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:433 build_context="(go=go1.16.7, user=root@364730518a4e, date=20210811-14:48:27)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:434 host_details="(Linux 5.4.0-70-generic #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021 x86_64 prometheus-deployment-5d79b5f65b-tkpd2 (none))"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:435 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:436 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-08-11T16:24:21.745Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-08-11T16:24:21.745Z caller=main.go:812 msg="Starting TSDB ..."
level=info ts=2021-08-11T16:24:21.748Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:815 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:829 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.15µs
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:835 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:892 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:898 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=75.316µs wal_replay_duration=451.769µs total_replay_duration=566.051µs
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:839 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:842 msg="TSDB started"
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:969 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-08-11T16:24:21.757Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.759Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.762Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:1006 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=7.940972ms db_storage=607ns remote_storage=1.251µs web_handler=283ns query_engine=694ns scrape=227.668µs scrape_sd=6.081132ms notify=27.11µs notify_sd=16.477µs rules=648.58µs
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:784 msg="Server is ready to receive web requests."
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:22.587Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:22.855Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.153Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.261Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:23.335Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:54.814Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.282Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.516Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:55.934Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:25:56.442Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.058Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.204Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.246Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:30.879Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:26:31.479Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:09.673Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:09.835Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:10.467Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:11.170Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:12.684Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:27:55.324Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:01.550Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:01.621Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:04.801Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:05.598Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https://10.96.0.1:443/api/v1/nodes?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:28:57.256Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"
level=error ts=2021-08-11T16:29:04.688Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://10.96.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 10.96.0.1:443: i/o timeout"

Is there a way to solve this issue and make node-exporters show up in the targets?

Version details:

kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

Edit: The cluster was set up as follows:

sudo kubeadm reset
sudo rm $HOME/.kube/config
sudo kubeadm init --pod-network-cidr=192.168.5.0/24
mkdir -p $HOME/.kube; sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config; sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

It is using flannel.

flannel pods are running:

kube-flannel-ds-45qwf                1/1     Running   0          31h   x.x.x.41   work01   <none>           <none>
kube-flannel-ds-4rwzj                1/1     Running   0          31h   x.x.x.40   mast01   <none>           <none>
kube-flannel-ds-8fdtt                1/1     Running   24         31h   x.x.x.43   work03   <none>           <none>
kube-flannel-ds-8hl5f                1/1     Running   23         31h   x.x.x.44   work04   <none>           <none>
kube-flannel-ds-xqtrd                1/1     Running   0          31h   x.x.x.42   work02   <none>           <none>
1
At first glance, those errors (i/o errors especially) would suggest that your SDN is not working properly. Could be limited to the node hosting your Prometheus Pod. Could affect other nodes in your cluster. Prometheus can't query your Kubernetes API: no service/pod/... can be discovered. Could you tell us more about your cluster? Have you followed some howto, blogpost, ... using kubeadm? What SDN did you set up? Are you sure it's working properly?SYN
yes, I am also worried it might be a networking issue. I did not set the cluster up myself. But I edited the post to include details on how it was set up (based on the command history I found) is there any command I can run to confirm that this is a networking issue?sqlquestionasker
As a test, you can open a shell on any worker node, and try to curl the API ( 10.96.0.1:443 ). If it works, check the other nodes as well, ... If that doesn't work, you may be missing a route (share route -n). Otherwise, try something similar from a Pod running on your worker nodes (without hostNetwork / must be within the SDN). If you can't reach the API, issue could be with iptables (iptables -nL) or ipvs (ipvsadm -l-n), maybe kube-proxy, or still flannel (check kubectl logs), ... If you find a node that works: compare iptables/ipvs configuration.SYN
Oh... And ... --pod-network-cidr=192.168.5.0/24. Sounds wrong. I think the default host subnet length is 24 as well: whenever a new node joins the cluster, a portion of your cluster pod network cidr is allocated to it. If you whole pod subnet is a /24, I suspect only your master had its pod subnet properly allocated, you may already be out of addresses for the others... check kubectl get nodes -o yaml. With flannel, you should find a spec.podCIDR and/or spec.podCIRDs array. Make sure all your nodes have their own subnet, within your cluster pod network.SYN
Also ... if you've installed the flannel configuration from their releases without editing it, ... then you should have deployed your cluster with --pod-network-cidr=10.244.0.0/16. See github.com/flannel-io/flannel/issues/1054SYN

1 Answers

2
votes

The issue is related to SDN not working properly.

As a general rule, troubleshooting this, we would check the SDN pods (calico, weave, or in this case flannel), are they healthy, any errors in their logs, ...

Check iptables (iptables -nL) and ipvs (ipvsadm -l n) configuration nodes.

Restart SDN pods, as well as kube-proxy, if you still didn't find anything.

Now, on this specific case, we're not suffering from an outage: cluster is freshly deployed, it's likely the SDN never worked at all - though this may not be obvious, with a kubeadm deployment, that doesn't ship with other pods than the defaults, most of which using host networking.

The kubeadm init command mentions that pod CIDR is some 192.168.5.0/24, which brings two remarks:

  • with all SDN: the pod CIDR is a subnet that will be split into smaller subnets (usually /24 or /25). Each range being statically allocated to Nodes when they first join your cluster

  • running flannel SDN: kubeadm init should include a --pod-network-cidr argument that MUST match the subnet configured in the kube-flannel-cfg ConfigMap, see net-conf.json key.

Though I'm unfamiliar with the process of fixing this, there seem to be an answer on ServerFault that gives some instructions, which sounds right: https://serverfault.com/a/977401/293779