gke removing instance groups from internal load balancers

Question

I have a sandbox gke cluster, with some services and some internal load balancers.

Services are mostly defined like this:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: my-app
  name: my-app
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
    cloud.google.com/load-balancer-type: "Internal"
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: my-app
  sessionAffinity: None
  type: LoadBalancer

But eventually someone reports that the endpoint is not working anymore (like twice a week), I go investigate and the load balancer has no instance groups attached anymore.

The only "weird" things we do are to scale all our app's pods down to 0 replicas when out of business hours and to use preemptible instances on the node pool... I thought it could be related to the first, but I forced scale down some services now and their load balancers still fine.

It may be related to preemptible though, seems like that if the pods are all in one instance (specially kube-system pods), when the node goes down the pods go down at all once, and it seems like it can recover properly from that.

Other weird thing I see happening is the k8s-ig--foobar coming to have 0 instances.

Has anyone experienced something like this? I couldn't find any docs about this.

What does kubectl describe service says? Any interesting annotations or logs? — Yves Junqueira
@YvesJunqueira no, everything looks fine. Also tried to see the logs on the kube-system services, nothing interesting there as well (and literally nothing in most of them)... — caarlos0
My best guess: when you bring the instances down, the load balancer keeps trying to find healthy instances. When it doesn't work, it probably used exponential back off or so to avoid retrying too much. Perhaps if you keep the instances down for too long the retry rate becomes like once per 4 hours, which obviously isn't enough when you need the instances back. To prove if this is right, you can try bringing the replicas up from 0 every 30 minutes or so, then bring them down again.. just to reset the retry back off. I'm not suggesting this as a fix, but as a way to debug GKE. :-) — Yves Junqueira
that makes sense! But as per my recent test right now, it seems that because the preeptible nodes go down after 24hours (and by pure bad luck, mine are going all at the same time), gke deletes the k8s-ig. When new nodes are up, the k8s-ig is recreated, but not assigned to the LBs. Seems like a bug to me... — caarlos0
I'll try to report this back to them, thanks for the help so far @YvesJunqueira =D — caarlos0

caarlos0 caarlos0 · Accepted Answer · 2018-02-28T12:32:28

I opened a bug and it was marked as "could not reproduce".

But, changing from preemptible to "normal" instance does "fix" the problem.

gke removing instance groups from internal load balancers

1 Answers