1
votes

My health checks fail with the following setup.

nginx.conf

user                            root;
worker_processes                auto;

error_log                       /var/log/nginx/error.log warn;

events {
    worker_connections          1024;
}

http {
    server {
        listen                  80;
        server_name             subdomain.domain.com
        auth_basic              "Restricted";
        auth_basic_user_file    /etc/nginx/.htpasswd;
    }
    server {
        listen                  80;
        auth_basic              off;
    }
    server {
        listen                  2222;
        auth_basic              off;
        location /healthz {
            return 200;
        }
    }
}

DOCKERFILE

FROM nginx:alpine
COPY index.html /usr/share/nginx/html/index.html
VOLUME /usr/share/nginx/html
COPY /server/nginx.conf /etc/nginx/
COPY /server/htpasswd /etc/nginx/.htpasswd
CMD ["nginx", "-g", "daemon off;"]
EXPOSE 80
EXPOSE 2222

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: gcr.io/GOOGLE_CLOUD_PROJECT/my-app
          ports:
            - containerPort: 80
            - containerPort: 2222
          livenessProbe:
            httpGet:
              path: /healthz
              port: 2222
          readinessProbe:
            httpGet:
              path: /healthz
              port: 2222

It definitely works when I delete the "server_name" row in nginx.conf and delete the second server block. Could this be an issue with ingress/load balancer, since I do not know how long it takes to update (I experienced a healthy pod go unhealthy after a few minutes yesterday). Running it on Google Kubernetes Engine (GKE) with Google's own ingress controller (not NGINX ingress!)

What am I doing wrong?

1
Can you show the logs of the failure? Does it fail immediately or only after a while?Andy Shinn
In addition to the logs, we need to ensure there is a route to port 2222 when subdomain.domain.com/healthz is targeted. It works when you delete the line because it probably starts treating the target as localhost. Please post the service and ingress yaml so we can investigate furtherWill R.O.F.

1 Answers

1
votes

The issue was that GKE's load balancer does its own health checks. These look at / by default and expect a 200 in return. Only when health checks in the deployment/pod have another path declared, the load balancer health check will pick up those paths.

The Load Balancer is provisioned after ingress YAML is applied. Any changes in the deployment or ingress that affect the load balancer will not be accepted as long as the load balancer runs. This means I had to delete the load balancer first and then apply the deployment, service and ingress YAMLs (ingress automatically sets up the load balancer then). Instead of deleting the load balancer one can enter the correct path manually (and wait a few minutes).

Since it seems the load balancer does health checks on each open port, I deleted my 2222 port and added location /healthz to each server block with port 80 in nginx with auth_basic off.

See: https://cloud.google.com/load-balancing/docs/health-check-concepts and https://stackoverflow.com/a/61222826/2534357 and https://stackoverflow.com/a/38511357/2534357

New nginx.conf

user                            root;
worker_processes                auto;

error_log                       /var/log/nginx/error.log warn;

events {
    worker_connections          1024;
}

http {
    server {
        listen                  80;
        server_name             subdomain1.domain.com;
        root                    /usr/share/nginx/html;
        index                   index.html;
        auth_basic              "Restricted";
        auth_basic_user_file    /etc/nginx/.htpasswd_subdomain1;
        location /healthz {
            auth_basic          off;
            allow               all;
            return              200;
        }
    }
    server {
        listen                  80;
        server_name             subdomain2.domain.com;
        root                    /usr/share/nginx/html;
        index                   index.html;
        auth_basic              "Restricted";
        auth_basic_user_file    /etc/nginx/.htpasswd_subdomain2;
        location /healthz {
            auth_basic          off;
            allow               all;
            return              200;
        }
    }
    server {
        listen                  80;
        server_name             domain.com www.domain.com;
        root                    /usr/share/nginx/html;
        index                   index.html;
        auth_basic              "Restricted";
        auth_basic_user_file    /etc/nginx/.htpasswd_domain;
        location /healthz {
            auth_basic          off;
            allow               all;
            return              200;
        }
    }
    ## next block probably not necessary
    server {
        listen                  80;
        auth_basic              off;
        location /healthz {
            return              200;
        }
    }
}

my new deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-namespace
  labels:
    app: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: gcr.io/GOOGLE_CLOUD_PROJECT/my-app
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /healthz
              port: 80
          readinessProbe:
            httpGet:
              path: /healthz
              port: 80