Application DNS Resolution Problem while nodes are scaling down

Question

I have a kubernetes cluster that is running on AWS EKS (version 1.16), my application pods and coredns pods are running as daemonset on the cluster. everything seems working fine in all conditions except scaling down. While node is scaling down, application is giving "mysqli::__construct(): php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution" error. DNS Resolution error is coming from all pods, I'm saying this because if the error is coming from one pod, then I can say that when scaling down is applied, coredns pod is being shutdown earlier than application pod on the same node so application can't resolve the DB hostname. Furthermore, dns requests are coming to kube-dns service first then going through the dns pods. So it can't be possible.

But on the other hand, I couldn't find any logical expression for this situation. Is there a possibility that my cluster autoscaler configuration is wrong?

My cluster autoscaler config is on below :

labels:
    app: cluster-autoscaler
    spec:
        containers:
            - command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --scan-interval=120s
            - --max-empty-bulk-delete=1
            - --scale-down-delay-after-delete=10m
            - --scale-down-unneeded-time=14m
            - --skip-nodes-with-local-storage=false
            - --scale-down-utilization-threshold=0.85
            - --skip-nodes-with-system-pods=false
            - --nodes=8:16:nodegroup-1
            - --nodes=3:10:nodegroup-2
    env:
        - name: AWS_REGION
          value: eu-west-1
    image: gcr.io/google-containers/cluster-autoscaler:v1.16.4
    imagePullPolicy: Always
    name: cluster-autoscaler
    resources:
        limits:
            cpu: 100m
            memory: 300Mi
            requests:
            cpu: 100m
            memory: 300Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/ssl/certs/ca-certificates.crt name: ssl-certs
            readOnly: true
            dnsPolicy: ClusterFirst
            restartPolicy: Always
            schedulerName: default-scheduler
            securityContext: {}
            serviceAccount: cluster-autoscaler
            serviceAccountName: cluster-autoscaler
            terminationGracePeriodSeconds: 30
    volumes:
        - hostPath:
            path: /etc/ssl/certs/ca-bundle.crt
            type: ""
            name: ssl-certs

Malgorzata Malgorzata · Accepted Answer · 2020-06-29T13:54:12

Try to put dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns.

Remember that using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

Take a look: eks-autoscaler.

Application DNS Resolution Problem while nodes are scaling down

1 Answers