Server behind GCP Load Balancer occalionally gets 502 Server Error, “failed_to_connect_to_backend”

Question

Our GCP Load Balancer occasionally returns 502 for some requests with "failed_to_connect_to_backend". It happens periodically. In googling and searching Stack Overflow I found this link: https://cloud.google.com/load-balancing/docs/https#timeouts_and_retries. I also went through several articles regarding keep-alive timeouts on GCP load balancer.

My servers are in Kubernetes are running low on CPU usage so it doesn't seem to be an issue that backend is too busy.

Here is sample code which I use to setup Http Server:

    server := &http.Server{
        Addr:              addr,
        Handler:           handler,
        ReadHeaderTimeout: 20 * time.Second,
        ReadTimeout:       1 * time.Minute,
        WriteTimeout:      2 * time.Minute,
        IdleTimeout:       time.Duration(tcpKeepAliveTimeout) * time.Second,
    }
    if e := listenAndServe(server, 620); e != nil && e != http.ErrServerClosed {
     return err
    }

func listenAndServe(srv *http.Server, tcpKeepAliveTimeout int) error {
    addr := srv.Addr
    if addr == "" {
        addr = ":http"
    }
    lc := net.ListenConfig{
        KeepAlive: 620 * time.Second,
    }

    ln, err := lc.Listen(context.Background(), "tcp", addr)
    if err != nil {
        return err
    }

    defer ln.Close()

    if err != nil {
        return err
    }
    return srv.Serve(ln)
}

I am setting 620 seconds timeout to TCP Keep-Alive (recommended in Google Documentation) but it doesn't help and I am still getting 502s. What am I doing wrong?

Why are you specifying 620 for the second parameter to listenAndServe? This is a handler and not a timeout/keepalive. golang.org/pkg/net/http/#ListenAndServe — John Hanley
I redefined listenAndServe from standard library in my code in order to create my own net.ListenConfig and redefine keepalive timeout. — Svyatoslav Grigoryev

Anurag Sharma Anurag Sharma · Accepted Answer · 2020-08-06T19:27:06

502 HTTP response code is generated when a GFE was not able to establish a connection to a backend instance.

Common Reasons for 502s are the following, and I will recommend you to verify these on your end:

Firewall (either GCP firewall rules or firewall software running on the instance itself) blocking traffic
Web server software not running on backend instance
Web server software misconfigured on backend instance Server resources exhausted and not accepting connections:
CPU usage too high to respond
Memory usage too high, process killed or can't malloc()
Maximum established TCP connections
Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
Poorly written server implementation struggling under load or non-standard behavior

Server behind GCP Load Balancer occalionally gets 502 Server Error, “failed_to_connect_to_backend”

2 Answers