I'm wondering what steps to take when troubleshooting why a Google load balancer would see Nodes within a cluster as unhealthy?
Using Google Kubernetes, I have a cluster with 3 nodes, all deployments are running readiness and liveness checks. All are reporting that they are healthy.
The load balancer is built from the helm nginx-ingress:
https://github.com/helm/charts/tree/master/stable/nginx-ingress
It's used as a single ingress for all the deployment applications within the cluster.
Visually scanning the ingress controllers logs:
kubectl logs <ingress-controller-name>
shows only the usual nginx output ... HTTP/1.1" 200 ...
I can't see any health checks within these logs. Not sure if I should but nothing to suggest anything is unhealhty.
Running a describe against the ingress controller shows no events, but it does show a liveness and readiness check which I'm not too sure would actually pass:
Name: umbrella-ingress-controller-****
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: gke-multi-client-n1--2cpu-4ram-****/10.154.0.50
Start Time: Fri, 15 Nov 2019 21:23:36 +0000
Labels: app=ingress
component=controller
pod-template-hash=7c55db4f5c
release=umbrella
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ingress-controller
Status: Running
IP: ****
Controlled By: ReplicaSet/umbrella-ingress-controller-7c55db4f5c
Containers:
ingress-controller:
Container ID: docker://****
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.24.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:****
Ports: 80/TCP, 443/TCP
Host Ports: 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--default-backend-service=default/umbrella-ingress-default-backend
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=default/umbrella-ingress-controller
State: Running
Started: Fri, 15 Nov 2019 21:24:38 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: umbrella-ingress-controller-**** (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from umbrella-ingress-token-**** (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
umbrella-ingress-token-2tnm9:
Type: Secret (a volume populated by a Secret)
SecretName: umbrella-ingress-token-****
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
However, using Googles console, I navigate to the load balancers details and can see the following:
Above 2 of the nodes seem to be having issues, albeit I can't find the issues.
At this point the load balancer is still serving traffic via the third, healthy node, however it will occasionally drop that and show me the following:
At this point no traffic gets past the load balancer, so all the applications on the nodes are unreachable.
Any help with where I should be looking to troubleshoot this would be great.
---- edit 17/11/19
Below is the nginx-ingress config passed via helm:
ingress:
enabled: true
rbac.create: true
controller:
service:
externalTrafficPolicy: Local
loadBalancerIP: ****
configData:
proxy-connect-timeout: "15"
proxy-read-timeout: "600"
proxy-send-timeout: "600"
proxy-body-size: "100m"
externalTrafficPolicy: local
for the service? considering the whole process works, and you have 1/3 healthy nodes, this is the most likely culprit. - Patrick Wcontroller.service.healthCheckNodePort
which reading the docs my set up might require? "If controller.service.type is NodePort or LoadBalancer and controller.service.externalTrafficPolicy is set to Local, set this to the managed health-check port the kube-proxy will expose..." - GuyC