Given is a cluster rather static workloads that are deployed to one fixed-size node-pool (default). An additional node-pool holds elastic workloads, the pool size changes from 0 - ~10 instances. During the scaling most of the times cluster is not responsive:
- I can't access some cluster pages on GKE like workloads (sorry for the German interface) https://i.stack.imgur.com/MSd3Y.png
- kubectl cant connect and existing connections like port-forward but also
get pods -w
would disconnect:E0828 12:36:14.495621 10818 portforward.go:233] lost connection to pod
The connection to the server 35.205.157.182 was refused - did you specify the right host or port?
- Also, I think relying tools like prom-operator run into issues, as some very default parameters like
kube_pod_container_info
are missing data during that time
What I tried so far, is switching from a regional to a zonal cluster (no-single-node-master?) but that didn't help. Also, the issue does not occur on every scale of the node-pool but in most cases.
So my question is - how to debug/fix that?