troubleshoot Google kubernetes load balancer unhealthy nodes

Question

I'm wondering what steps to take when troubleshooting why a Google load balancer would see Nodes within a cluster as unhealthy?

Using Google Kubernetes, I have a cluster with 3 nodes, all deployments are running readiness and liveness checks. All are reporting that they are healthy.

The load balancer is built from the helm nginx-ingress:

https://github.com/helm/charts/tree/master/stable/nginx-ingress

It's used as a single ingress for all the deployment applications within the cluster.

Visually scanning the ingress controllers logs:

kubectl logs <ingress-controller-name>

shows only the usual nginx output ... HTTP/1.1" 200 ... I can't see any health checks within these logs. Not sure if I should but nothing to suggest anything is unhealhty.

Running a describe against the ingress controller shows no events, but it does show a liveness and readiness check which I'm not too sure would actually pass:

Name:               umbrella-ingress-controller-****
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-multi-client-n1--2cpu-4ram-****/10.154.0.50
Start Time:         Fri, 15 Nov 2019 21:23:36 +0000
Labels:             app=ingress
                    component=controller
                    pod-template-hash=7c55db4f5c
                    release=umbrella
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ingress-controller
Status:             Running
IP:                 ****
Controlled By:      ReplicaSet/umbrella-ingress-controller-7c55db4f5c
Containers:
  ingress-controller:
    Container ID:  docker://****
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:****
    Ports:         80/TCP, 443/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=default/umbrella-ingress-default-backend
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=default/umbrella-ingress-controller
    State:          Running
      Started:      Fri, 15 Nov 2019 21:24:38 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       umbrella-ingress-controller-**** (v1:metadata.name)
      POD_NAMESPACE:  default (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from umbrella-ingress-token-**** (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  umbrella-ingress-token-2tnm9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  umbrella-ingress-token-****
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

However, using Googles console, I navigate to the load balancers details and can see the following:

Above 2 of the nodes seem to be having issues, albeit I can't find the issues.

At this point the load balancer is still serving traffic via the third, healthy node, however it will occasionally drop that and show me the following:

At this point no traffic gets past the load balancer, so all the applications on the nodes are unreachable.

Any help with where I should be looking to troubleshoot this would be great.

---- edit 17/11/19

Below is the nginx-ingress config passed via helm:

ingress:
  enabled: true
  rbac.create: true
  controller:
    service:
      externalTrafficPolicy: Local
      loadBalancerIP: ****
  configData:
    proxy-connect-timeout: "15"
    proxy-read-timeout: "600"
    proxy-send-timeout: "600"
    proxy-body-size: "100m"

are you using externalTrafficPolicy: local for the service? considering the whole process works, and you have 1/3 healthy nodes, this is the most likely culprit. — Patrick W
@PatrickW I am yes - I've edited the question to include it above - what do you think is wrong? Why would this cause unhealthy nodes to be reported? — GuyC
I also see i'm not using: controller.service.healthCheckNodePort which reading the docs my set up might require? "If controller.service.type is NodePort or LoadBalancer and controller.service.externalTrafficPolicy is set to Local, set this to the managed health-check port the kube-proxy will expose..." — GuyC
Since you're using nginx ingress controller, you do not need to use nodeport — Patrick W

Patrick W Patrick W · Accepted Answer · 2019-11-17T18:09:51

This is expected behavior. Using externalTrafficPolicy: local configures the service so that only nodes where a serving pod exists will accept traffic. What this means is that any node that does not have a serving pod that receives traffic to the service will drop the packet.

The GCP Network Loadbalancer is still sending traffic to each node to test the health. The health check will use the service NodePort. Any node that contains nginx loadbalancer pods will respond to the health check. Any node that does not have an nginx load balancer pod will drop the packet so the check fails.

This results in only certain nodes showing as healthy.

For the nginx ingress controller, I recommend using the default value of cluster instead of changing it to local.

troubleshoot Google kubernetes load balancer unhealthy nodes

1 Answers