1
votes

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?


Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.

I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.

Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:

I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:

What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?

2
Enabling Connection Draining cloud.google.com/compute/docs/load-balancing/…Dagang

2 Answers

2
votes

I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining

0
votes

To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer