One openshift-origin worker node won't resolv cluster.local records, causing Imagepullbackoff

Question

We have setup an okd 3.11 cluster with 100+ nodes. Everything was working fine but then a worker node stopped resolving the registry service internal url. This causes new pods to be scheduled to that node fail with ImagePullBackoff error.

Failed to pull image "docker-registry.default.svc:5000/app-name/app-name:latest": rpc error: code = Unknown desc = Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 10.*.*.71:53: server misbehaving

We tried running nslookup on the worker node and following were the results

While this doesn't work (while it works on other nodes)

[root@worker22 ~]# nslookup docker-registry.default.svc.cluster.local
Server:         10.*.*.71
Address:        10.*.*.71#53

** server can't find docker-registry.default.svc.cluster.local: SERVFAIL

This works just fine.

[root@worker22 ~]# nslookup docker-registry.default.svc.cluster.local 127.0.0.1
Server:         127.0.0.1
Address:        127.0.0.1#53

Name:   docker-registry.default.svc.cluster.local
Address: 172.*.*.212

Adding server=/cluster.local/172.30.0.1 to dnsmasq conf file /etc/dnsmasq.d/origin-upstream-dns.conf works as a work around but can't find what is causing this.

I have tried adding -q to dnsmasq service's ExecStart and it shows that the dnsmasq won't query the openshift dns running locally at 127.0.0.1:53.

Dnsmasq config/resolv.conf is in order on the node.

I have tried restarting dnsmasq/NetworkManager/Docker, I have tried respawning ovs/sdn pods but still no help.

srv_ER srv_ER · Accepted Answer · 2021-02-20T12:10:52

Found some documented evidence that dnsmasq can behave like that.

It has been suggested by some RedHat articles that a long running dnsmasq service may misbehave and stop resolving names. Similar cases have been reported for openshift environment as well.

The links below suggest that restarting the service would solve the problem for some time and then the issue may resurface. As stated earlier, in my case service restart didn't help but oldest remedy in IT worked (rebooting the node solved the problem).

Reference:

https://access.redhat.com/solutions/3393141

https://bugzilla.redhat.com/show_bug.cgi?id=1560489

One openshift-origin worker node won't resolv cluster.local records, causing Imagepullbackoff

1 Answers