GKE DNS resolution errors

Question

We use Kubernetes cronjobs on GKE (version 1.9) for running several periodic tasks. From the pods, we need to make several calls to external API outside our network. Often (but not all the time), these calls fail because of DNS resolution timeouts.

The current hypothesis I have is that the upstream DNS server for the service we are trying to contact is rate limiting the requests where we make lots of repeated DNS requests because the TTL for those records was either too low or just because we dropped those entries from dnsmasq cache due to low cache size.

I tried editing the kube-dns deployment to change the cache size and ttl arguments passed to dnsmasq container, but the changes get reverted because it's a managed deployment by GKE. Is there a way to persist these changes so that GKE does not overwrite them? Any other ideas to deal with dns issues on GKE or Kubernetes engine in general?

There is no way to keep customized changes to kube-dns since it is handled by the Master node. The Master node will just keep reverting the changes. I would suggest checking the kube-dns pods to see if they were restarted and maybe check the events as well by running "kubectl get events --namespace=kube-systems" and see if there are any issues with the kube-dns pod(s). — Jason

KarlKFI KarlKFI · Accepted Answer · 2018-08-21T03:06:49

Not sure if all knobs are covered, but if you update the ConfigMap used by the deployment you should be able to reconfigure KubeDNS on GKE. It will use the ConfigMap when deploying new instances. Then nuke the existing pods to redeploy them with the new config.

GKE DNS resolution errors

2 Answers