NEG says Pods are 'unhealthy', but actually the Pods are healthy

Question

I'm trying to apply gRPC load balancing with Ingress on GCP, and for this I referenced this example. The example shows gRPC load balancing is working by 2 ways(one with envoy side-car and the other one is HTTP mux, handling both gRPC/HTTP-health-check on same Pod.) However, the envoy proxy example doesn't work.

What makes me confused is, the Pods are running/healthy(confirmed by kubectl describe, kubectl logs)

$ kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
fe-deployment-757ffcbd57-4w446   2/2     Running   0          4m22s
fe-deployment-757ffcbd57-xrrm9   2/2     Running   0          4m22s


$ kubectl describe pod fe-deployment-757ffcbd57-4w446
Name:               fe-deployment-757ffcbd57-4w446
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc/10.128.0.64
Start Time:         Thu, 26 Sep 2019 16:15:18 +0900
Labels:             app=fe
                    pod-template-hash=757ffcbd57
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container fe-envoy; cpu request for container fe-container
Status:             Running
IP:                 10.56.1.29
Controlled By:      ReplicaSet/fe-deployment-757ffcbd57
Containers:
  fe-envoy:
    Container ID:  docker://b4789909494f7eeb8d3af66cb59168e009c582d412d8ca683a7f435559989421
    Image:         envoyproxy/envoy:latest
    Image ID:      docker-pullable://envoyproxy/envoy@sha256:9ef9c4fd6189fdb903929dc5aa0492a51d6783777de65e567382ac7d9a28106b
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/envoy
    Args:
      -c
      /data/config/envoy.yaml
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:19 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Liveness:     http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /data/certs from certs-volume (rw)
      /data/config from envoy-config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
  fe-container:
    Container ID:  docker://a533224d3ea8b5e4d5e268a616d73762b37df69f434342459f35caa8fac32dab
    Image:         salrashid123/grpc_only_backend
    Image ID:      docker-pullable://salrashid123/grpc_only_backend@sha256:ebfac594116445dd67aff7c9e7a619d73222b60947e46ef65ee6d918db3e1f4b
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      /grpc_server
    Args:
      --grpcport
      :50051
      --insecure
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:20 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  certs-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fe-secret
    Optional:    false
  envoy-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      envoy-configmap
    Optional:  false
  default-token-c7nqc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c7nqc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                                          Message
  ----     ------     ----                   ----                                                          -------
  Normal   Scheduled  4m25s                  default-scheduler                                             Successfully assigned default/fe-deployment-757ffcbd57-4w446 to gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc
  Normal   Pulled     4m25s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Container image "envoyproxy/envoy:latest" already present on machine
  Normal   Created    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Created container
  Normal   Started    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Started container
  Normal   Pulling    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  pulling image "salrashid123/grpc_only_backend"
  Normal   Pulled     4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Successfully pulled image "salrashid123/grpc_only_backend"
  Normal   Created    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Created container
  Normal   Started    4m23s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Started container
  Warning  Unhealthy  4m10s (x2 over 4m20s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  4m9s (x2 over 4m19s)   kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Liveness probe failed: HTTP probe failed with statuscode: 503


$ kubectl describe pod fe-deployment-757ffcbd57-xrrm9
Name:               fe-deployment-757ffcbd57-xrrm9
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9/10.128.0.22
Start Time:         Thu, 26 Sep 2019 16:15:18 +0900
Labels:             app=fe
                    pod-template-hash=757ffcbd57
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container fe-envoy; cpu request for container fe-container
Status:             Running
IP:                 10.56.0.23
Controlled By:      ReplicaSet/fe-deployment-757ffcbd57
Containers:
  fe-envoy:
    Container ID:  docker://255dd6cab1e681e30ccfe158f7d72540576788dbf6be60b703982a7ecbb310b1
    Image:         envoyproxy/envoy:latest
    Image ID:      docker-pullable://envoyproxy/envoy@sha256:9ef9c4fd6189fdb903929dc5aa0492a51d6783777de65e567382ac7d9a28106b
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/envoy
    Args:
      -c
      /data/config/envoy.yaml
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:19 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Liveness:     http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /data/certs from certs-volume (rw)
      /data/config from envoy-config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
  fe-container:
    Container ID:  docker://f6a0246129cc89da846c473daaa1c1770d2b5419b6015098b0d4f35782b0a9da
    Image:         salrashid123/grpc_only_backend
    Image ID:      docker-pullable://salrashid123/grpc_only_backend@sha256:ebfac594116445dd67aff7c9e7a619d73222b60947e46ef65ee6d918db3e1f4b
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      /grpc_server
    Args:
      --grpcport
      :50051
      --insecure
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:20 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  certs-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fe-secret
    Optional:    false
  envoy-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      envoy-configmap
    Optional:  false
  default-token-c7nqc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c7nqc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                                          Message
  ----     ------     ----                  ----                                                          -------
  Normal   Scheduled  5m8s                  default-scheduler                                             Successfully assigned default/fe-deployment-757ffcbd57-xrrm9 to gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9
  Normal   Pulled     5m8s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Container image "envoyproxy/envoy:latest" already present on machine
  Normal   Created    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Created container
  Normal   Started    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Started container
  Normal   Pulling    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  pulling image "salrashid123/grpc_only_backend"
  Normal   Pulled     5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Successfully pulled image "salrashid123/grpc_only_backend"
  Normal   Created    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Created container
  Normal   Started    5m6s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Started container
  Warning  Unhealthy  4m53s (x2 over 5m3s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  4m52s (x2 over 5m2s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Liveness probe failed: HTTP probe failed with statuscode: 503


$ kubectl get services
NAME             TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)           AGE
fe-srv-ingress   NodePort       10.123.5.165   <none>         8080:30816/TCP    6m43s
fe-srv-lb        LoadBalancer   10.123.15.36   35.224.69.60   50051:30592/TCP   6m42s
kubernetes       ClusterIP      10.123.0.1     <none>         443/TCP           2d2h


$ kubectl describe service fe-srv-ingress
Name:                     fe-srv-ingress
Namespace:                default
Labels:                   type=fe-srv
Annotations:              cloud.google.com/neg: {"ingress": true}
                          cloud.google.com/neg-status:
                            {"network_endpoint_groups":{"8080":"k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2"},"zones":["us-central1-a"]}
                          kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"cloud.google.com/neg":"{\"ingress\": true}","service.alpha.kubernetes.io/a...
                          service.alpha.kubernetes.io/app-protocols: {"fe":"HTTP2"}
Selector:                 app=fe
Type:                     NodePort
IP:                       10.123.5.165
Port:                     fe  8080/TCP
TargetPort:               8080/TCP
NodePort:                 fe  30816/TCP
Endpoints:                10.56.0.23:8080,10.56.1.29:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason  Age    From            Message
  ----    ------  ----   ----            -------
  Normal  Create  6m47s  neg-controller  Created NEG "k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2" for default/fe-srv-ingress-8080/8080 in "us-central1-a".
  Normal  Attach  6m40s  neg-controller  Attach 2 network endpoint(s) (NEG "k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2" in zone "us-central1-a")

but NEG says they are unhealthy(so Ingress also says backend is unhealthy).

I couldn't found what caused this. Does anyone know how to solve this?

Test environment:

GKE, 1.13.7-gke.8 (VPC enabled)
Default HTTP(s) load balancer on Ingress

YAML files I used(same with the example previously mentioned),

envoy-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-configmap
  labels:
    app: fe
data:
  config: |-
    ---
    admin:
      access_log_path: /dev/null
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 9000
    node:
      cluster: service_greeter
      id: test-id
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 8080 }
        filter_chains:
        - filters:
          - name: envoy.http_connection_manager
            config:
              stat_prefix: ingress_http
              codec_type: AUTO
              route_config:
                name: local_route
                virtual_hosts:
                - name: local_service
                  domains: ["*"]
                  routes:
                  - match:
                      path: "/echo.EchoServer/SayHello"
                    route: { cluster: local_grpc_endpoint  }
              http_filters:
              - name: envoy.lua
                config:
                  inline_code: |
                    package.path = "/etc/envoy/lua/?.lua;/usr/share/lua/5.1/nginx/?.lua;/etc/envoy/lua/" .. package.path
                    function envoy_on_request(request_handle)

                      if request_handle:headers():get(":path") == "/_ah/health" then
                        local headers, body = request_handle:httpCall(
                        "local_admin",
                        {
                          [":method"] = "GET",
                          [":path"] = "/clusters",
                          [":authority"] = "local_admin"
                        },"", 50)


                        str = "local_grpc_endpoint::127.0.0.1:50051::health_flags::healthy"
                        if string.match(body, str) then
                          request_handle:respond({[":status"] = "200"},"ok")
                        else
                          request_handle:logWarn("Envoy healthcheck failed")     
                          request_handle:respond({[":status"] = "503"},"unavailable")
                        end
                      end
                    end              
              - name: envoy.router
                typed_config: {}
          tls_context:
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: "/data/certs/tls.crt"
                  private_key:
                    filename: "/data/certs/tls.key"
      clusters:
      - name: local_grpc_endpoint
        connect_timeout: 0.05s
        type:  STATIC
        http2_protocol_options: {}
        lb_policy: ROUND_ROBIN
        common_lb_config:
          healthy_panic_threshold:
            value: 50.0   
        health_checks:
          - timeout: 1s
            interval: 5s
            interval_jitter: 1s
            no_traffic_interval: 5s
            unhealthy_threshold: 1
            healthy_threshold: 3
            grpc_health_check:
              service_name: "echo.EchoServer"
              authority: "server.domain.com"
        hosts:
        - socket_address:
            address: 127.0.0.1
            port_value: 50051
      - name: local_admin
        connect_timeout: 0.05s
        type:  STATIC
        lb_policy: ROUND_ROBIN
        hosts:
        - socket_address:
            address: 127.0.0.1
            port_value: 9000

fe-deployment.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: fe-deployment
  labels:
    app: fe
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: fe
    spec:
      containers:

      - name: fe-envoy
        image: envoyproxy/envoy:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          httpGet:
            path: /_ah/health
            scheme: HTTPS
            port: fe
        readinessProbe:
          httpGet:
            path: /_ah/health
            scheme: HTTPS
            port: fe
        ports:
        - name: fe
          containerPort: 8080
          protocol: TCP               
        command: ["/usr/local/bin/envoy"]
        args: ["-c", "/data/config/envoy.yaml"]
        volumeMounts:
        - name: certs-volume
          mountPath: /data/certs
        - name: envoy-config-volume
          mountPath: /data/config

      - name: fe-container
        image: salrashid123/grpc_only_backend  # This runs gRPC secure/insecure server using port argument(:50051). Port 50051 is also exposed on Dockerfile.
        imagePullPolicy: Always         
        ports:
        - containerPort: 50051
          protocol: TCP                 
        command: ["/grpc_server"]
        args: ["--grpcport", ":50051", "--insecure"]

      volumes:
        - name: certs-volume
          secret:
            secretName: fe-secret
        - name: envoy-config-volume
          configMap:
             name: envoy-configmap
             items:
              - key: config
                path: envoy.yaml

fe-srv-ingress.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: fe-srv-ingress
  labels:
    type: fe-srv
  annotations:
    service.alpha.kubernetes.io/app-protocols: '{"fe":"HTTP2"}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: NodePort 
  ports:
  - name: fe
    port: 8080
    protocol: TCP
    targetPort: 8080       
  selector:
    app: fe

fe-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: fe-ingress
  annotations:
    kubernetes.io/ingress.allow-http: "false"
spec:
  tls:
  - hosts:
    - server.domain.com
    secretName: fe-secret
  rules:
  - host: server.domain.com  
    http:
      paths:
      - path: /echo.EchoServer/*
        backend:
          serviceName: fe-srv-ingress
          servicePort: 8080

Hi @isbee , would you please show results of kubectl describe for each pod, thx. — Yasen
@Yasen Done :) I also try to upload images explain NEG status(on GCP console), but I have low reputation so not allowed.. — isbee

Kote Isaev Kote Isaev · Accepted Answer · 2020-01-16T14:20:30

I had to allow any traffic from IP range specified as health checks source in documentation pages - 130.211.0.0/22, 35.191.0.0/16 , seen it here: https://cloud.google.com/kubernetes-engine/docs/how-to/standalone-neg And I had to allow it for default network and for the new network (regional) the cluster lives in. When I added these firewall rules, health checks could reach the pods exposed in NEG used as a regional backend within a backend service of our Http(s) load balancer.

May be there is a more restrictive firewall setup, but I just cut the corners and allowed anything from IP range declared to be healthcheck source range from the page referenced above.

NEG says Pods are 'unhealthy', but actually the Pods are healthy

2 Answers