GKE NEG Readiness Gate Failing with Windows Containers and Readiness Probe

Question

I'm running into an issue:

Getting a health check to succeed for a .Net app running in an IIS Container when trying to use Container Native Load Balancing(CNLB).

I have a Network Endpoint Group(NEG) created by an Ingress resource definition in GKE with a VPC Native Cluster.

When I circumvent CNLB by either exposing the NodePort or making a service of type LoadBalancer, the site resolves without issue.

All the pod conditions from a describe look good: pod readiness

The network endpoints show up when running describe endpoints: ready addresses

This is the health check that is generated by the load balancer: GCP Health Check

When hitting these endpoints from other containers or VMs in the same VPC, /health.htm responds with a 200. Here's from a container in the same namespace, though I have reproduced this with a Linux VM, not in the cluster but in the same VPC: endpoint responds

But in spite of it all, the health check is reporting the pods in my NEG unhealthy: Unhealthy Endpoints

The stackdriver logs confirm the requests are timing out but I'm not sure why when the endpoints are responding to other instances but not the LB: Stackdriver Health Check Log

And I confirmed that GKE created what looks like the correct firewall rule that should allow traffic from the LB to the pods: firewall

Here is the YAML I'm working with:

Deployment:

apiVersion: apps/v1                                                  
kind: Deployment                                                     
metadata:                                                            
  labels:                                                            
    app: subdomain.domain.tld                                       
  name: subdomain-domain-tld                                       
  namespace: subdomain-domain-tld
spec:                                                                
  replicas: 3                                                        
  selector:                                                          
    matchLabels:                                                     
      app: subdomain.domain.tld                                     
  template:                                                          
    metadata:                                                        
      labels:                                                        
        app: subdomain.domain.tld
    spec:                                                            
      containers:                                                    
      - image: gcr.io/ourrepo/ourimage
        name: subdomain-domain-tld
        ports:                                                       
        - containerPort: 80                                          
        readinessProbe:                                              
          httpGet:                                                   
            path: /health.htm                                        
            port: 80                                                 
          initialDelaySeconds: 60                                    
          periodSeconds: 60                                          
          timeoutSeconds: 10                                         
        volumeMounts:                                                
        - mountPath: C:\some-secrets                                      
          name: some-secrets
      nodeSelector:                                                  
        kubernetes.io/os: windows                                    
      volumes:                                                       
      - name: some-secrets                                    
        secret:                                                      
          secretName: some-secrets

Service:

apiVersion: v1                                                       
kind: Service                                                        
metadata:                                                            
  labels:                                                            
    app: subdomain.domain.tld                                     
  name: subdomain-domain-tld-service
  namespace: subdomain-domain-tld
spec:                                                                
  ports:                                                             
  - port: 80                                                         
    targetPort: 80                                                   
  selector:                                                          
    app: subdomain.domain.tld                                       
  type: NodePort

Ingress is extremely basic as we have no real need for multiple routes on this site, however, I'm suspecting whatever issues we're having are here.

apiVersion: extensions/v1beta1                                       
kind: Ingress                                                        
metadata:                                                            
  annotations:                                                       
    kubernetes.io/ingress.class: gce
  labels:                                                            
    app: subdomain.domain.tld                                       
  name: subdomain-domain-tld-ingress
  namespace: subdomain-domain-tld
spec:                                                                
  backend:                                                           
    serviceName: subdomain-domain-tld-service
    servicePort: 80

Last somewhat relevant detail is I tried the steps present in this documentation and it worked but it's not identical to my situation as its not using Windows Containers nor Readiness Probes: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#using-pod-readiness-feedback

Any suggestions would be greatly appreciated. I've spent two days stuck on this and I'm sure it's obvious but I just can't see the problem.

if it possible to switch to linux container ? If so, we can give you solution — Abdennour TOUMI
Are you allowing ingress/egress everywhere? All firewalls and Kubernetes network policies? Also allowing both on the cluster and to/from the load-balancer? — Rico
Unfortunately I can't switch to a linux container as the app we're running is asp.net rather than .net core and we're unable to port it to .net core @AbdennourTOUMI — 210rain
@Rico Yes, the cluster its on is used purely for looking into the feasibility of running our asp.net sites in GKE so I haven't configured any network policies. I've allowed all traffic on all ports to any instance in my VPC from 35.191.0.0/16 and 130.211.0.0/22 which are the IP ranges Google Load Balancers send traffic from per the documentation on this page: cloud.google.com/load-balancing/docs/health-checks I can also confirm there are no other firewall rules that would be taking over priority and denying the traffic. — 210rain
Must be some firewall rule somewhere. You can always check with GKE support. — Rico

210rain 210rain · Accepted Answer · 2020-07-28T19:35:47

Apparently it's not documented but this functionality doesn't work with Windows containers at the time of writing. I was able to get in touch with a GCP Engineer and they provided the following:

After further investigation, I have found that Windows containers using LoadBalancer service works but, Windows containers using Ingress with NEGS is a limitation so, I have opened an internal case for updating the public documentation [1].

Since, Ingress + NEG will not work (per the limitation), I suggest you to use any option you mentioned either exposing the NodePort or making a service of type LoadBalancer.

GKE NEG Readiness Gate Failing with Windows Containers and Readiness Probe

2 Answers