0
votes

Our GCP Load Balancer occasionally returns 502 for some requests with "failed_to_connect_to_backend". It happens periodically. In googling and searching Stack Overflow I found this link: https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries. I also went through several articles regarding keep-alive timeouts on GCP load balancer.

My servers are in Kubernetes are running low on CPU usage so it doesn't seem to be an issue that backend is too busy.

Here is sample code which I use to setup Http Server:

    server := &http.Server{
        Addr:              addr,
        Handler:           handler,
        ReadHeaderTimeout: 20 * time.Second,
        ReadTimeout:       1 * time.Minute,
        WriteTimeout:      2 * time.Minute,
        IdleTimeout:       time.Duration(tcpKeepAliveTimeout) * time.Second,
    }
    if e := listenAndServe(server, 620); e != nil && e != http.ErrServerClosed {
     return err
    }
func listenAndServe(srv *http.Server, tcpKeepAliveTimeout int) error {
    addr := srv.Addr
    if addr == "" {
        addr = ":http"
    }
    lc := net.ListenConfig{
        KeepAlive: 620 * time.Second,
    }

    ln, err := lc.Listen(context.Background(), "tcp", addr)
    if err != nil {
        return err
    }

    defer ln.Close()

    if err != nil {
        return err
    }
    return srv.Serve(ln)
}

I am setting 620 seconds timeout to TCP Keep-Alive (recommended in Google Documentation) but it doesn't help and I am still getting 502s. What am I doing wrong?

2
Why are you specifying 620 for the second parameter to listenAndServe? This is a handler and not a timeout/keepalive. golang.org/pkg/net/http/#ListenAndServeJohn Hanley
I redefined listenAndServe from standard library in my code in order to create my own net.ListenConfig and redefine keepalive timeout.Svyatoslav Grigoryev

2 Answers

0
votes

502 HTTP response code is generated when a GFE was not able to establish a connection to a backend instance.

Common Reasons for 502s are the following, and I will recommend you to verify these on your end:

  • Firewall (either GCP firewall rules or firewall software running on the instance itself) blocking traffic
  • Web server software not running on backend instance
  • Web server software misconfigured on backend instance Server resources exhausted and not accepting connections:
  • CPU usage too high to respond
  • Memory usage too high, process killed or can't malloc()
  • Maximum established TCP connections
  • Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
  • Poorly written server implementation struggling under load or non-standard behavior
0
votes

Here is what I found:

  1. I am running preemptible nodes on GKE
  2. I have a script which deletes all containers on the GKE node is preempted.
  3. I am exposing my services via nodePort
  4. When a node is preempted backend still route traffic to "nodePort" event node is deleted until backend health check fails.
  5. Solution was to move from nodePort to neg endpoints with container-native loadbalancing.