6
votes

I have a gRPC server that works fine on my local machine. I can send grpc requests from a python app and get back the right responses.

I put the server into a GKE cluster (with only one node). I had a normal TCP load balancer in front of the cluster. In this setup my local client was able to get the correct response from some requests, but not others. I think it was the gRPC streaming that didn't work.

I assumed that this is because the streaming requires an HTTP/2 connection which requires SSL.

The standard load balancer I got in GKE didn't seem to support SSL, so I followed the docs to set up an ingress load balancer which does. I'm using a Lets-Encrypt certificate with it.

Now all gRPC requests return

status = StatusCode.UNAVAILABLE

details = "Socket closed"

debug_error_string = "{"created":"@1556172211.931158414","description":"Error received from peer ipv4:ip.of.ingress.service:443", "file":"src/core/lib/surface/call.cc", "file_line":1041,"grpc_message":"Socket closed","grpc_status":14}"

The IP address is the external IP address of my ingress service. The ingress yaml looks like this:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: rev79-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "rev79-ip"
    ingress.gcp.kubernetes.io/pre-shared-cert: "lets-encrypt-rev79"
    kubernetes.io/ingress.allow-http: "false" # disable HTTP
spec:
  rules:
  - host: sub-domain.domain.app
    http:
      paths:
      - path: /*
        backend:
          serviceName: sandbox-nodes
          servicePort: 60000

The subdomain and domain of the request from my python app match the host in the ingress rule.

It connects to a node-port that looks like this:

apiVersion: v1
kind: Service
metadata:
  name: sandbox-nodes
spec:
  type: NodePort
  selector:
    app: rev79
    environment: sandbox
  ports:
  - protocol: TCP
    port: 60000
    targetPort: 9000

The node itself has two containers and looks like this:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: rev79-sandbox
  labels:
    app: rev79
    environment: sandbox
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: rev79
        environment: sandbox
    spec:
      containers:
      - name: esp
        image: gcr.io/endpoints-release/endpoints-runtime:1.31
        args: [
          "--http2_port=9000",
          "--service=rev79.endpoints.rev79-232812.cloud.goog",
          "--rollout_strategy=managed",
          "--backend=grpc://0.0.0.0:3011"
        ]
        ports:
        - containerPort: 9000
      - name: rev79-uac-sandbox
        image: gcr.io/rev79-232812/uac:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 3011
        env:
        - name: RAILS_MASTER_KEY
          valueFrom:
            secretKeyRef:
              name: rev79-secrets
              key: rails-master-key

The target of the node port is the ESP container which connects to the gRPC service deployed in the cloud, and the backend which is a Rails app that implements the backend of the API. This rails app isn't running the rails server, but a specialised gRPC server that comes with the grpc_for_rails gem

The grpc_server in the Rails app doesn't record any action in the logs, so I don't think the request gets that far.

kubectl get ingress reports this:

NAME            HOSTS                   ADDRESS            PORTS   AGE
rev79-ingress   sub-domain.domain.app   my.static.ip.addr   80      7h

showing port 80, even though it's set up with SSL. That seems to be a bug. When I check with curl -kv https://sub-domain.domain.app the ingress server handles the request fine, and uses HTTP/2. It reurns an HTML formatted server error, but I'm not sure what generates that.

The API requires an API key, which the python client inserts into the metadata of each request.

When I go to the endpoints page of my GCP console I see that the API is not registering any requests since putting in the ingress loadbalancer, so it looks like the requests are not reaching the EPS container.

So why am I getting "socket closed" errors with gRPC?

1
You can add these environment variables on your python client to get more detailed debug information: GRPC_VERBOSITY=DEBUG and GRPC_TRACE=api,channel,call_error,connectivity_state,http,server_channel if you add that to your question, it might help answering it. Do you use any kind of proxy on your side? (corporate proxy usually don't support grpc/http2)Thomas
I've discovered my health checks are failing. Now I have problems getting functional health checks for my gRPC backend, and asking separate question on SO about that. I'll post an answer here once I've actually got it workingToby 1 Kenobi

1 Answers

1
votes

I said I would come back and post an answer here once I got it working. It looks like I never did. Being a man of my word I'll post now my config files which are working for me.

in my deployment I've put a liveness and readiness probe for the ESP container. This made deployments happen smoothly without downtime:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: rev79-sandbox
  labels:
    app: rev79
    environment: sandbox
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: rev79
        environment: sandbox
    spec:
      volumes:
      - name: nginx-ssl
        secret:
          secretName: nginx-ssl
      - name: gcs-creds
        secret:
          secretName: rev79-secrets
          items:
            - key: gcs-credentials
              path: "gcs.json"
      containers:
      - name: esp
        image: gcr.io/endpoints-release/endpoints-runtime:1.45
        args: [
          "--http_port", "8080",
          "--ssl_port", "443",
          "--service", "rev79-sandbox.endpoints.rev79-232812.cloud.goog",
          "--rollout_strategy", "managed",
          "--backend", "grpc://0.0.0.0:3011",
          "--cors_preset", "cors_with_regex",
          "--cors_allow_origin_regex", ".*",
          "-z", " "
        ]
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 60
          timeoutSeconds: 5
          periodSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          timeoutSeconds: 5
          failureThreshold: 1
        volumeMounts:
        - name: nginx-ssl
          mountPath: /etc/nginx/ssl
          readOnly: true
        ports:
        - containerPort: 8080
        - containerPort: 443
          protocol: TCP
      - name: rev79-uac-sandbox
        image: gcr.io/rev79-232812/uac:29eff5e
        imagePullPolicy: Always
        volumeMounts:
          - name: gcs-creds
            mountPath: "/app/creds"
        ports:
        - containerPort: 3011
          name: end-grpc
        - containerPort: 3000
        env:
        - name: RAILS_MASTER_KEY
          valueFrom:
            secretKeyRef:
              name: rev79-secrets
              key: rails-master-key

This is my service config that exposes the deployment to the load balancer:

apiVersion: v1
kind: Service
metadata:
  name: rev79-srv-ingress-sandbox
  labels:
    type: rev79-srv
  annotations:
    service.alpha.kubernetes.io/app-protocols: '{"rev79":"HTTP2"}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: NodePort 
  ports:
  - name: rev79
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: rev79
    environment: sandbox

And this is my ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: rev79-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "rev79-global-ip"
spec:
  tls:
  - secretName: sandbox-api-rev79-app-tls
  rules:
  - host: sandbox-api.rev79.app
    http:
      paths:
      - backend:
          serviceName: rev79-srv-ingress-sandbox
          servicePort: 443

I'm using cert-manager to manage the certificates.

It was a long time agao now. I can't remember if there was anything else I did to solve the issue I was having