I am using a Kubernetes cluster with 2 workers. I have approximately 100 deployments. Each of them has 2 or 4 replicas (so I have approximately 300 pods per worker, yeah it's a lot of pods).
My problems are: When a worker is down, Kubernetes is trying to redeploy every failing pod on the remaining alive node. So at the end of the operation I have: - the remaining alive worker node with 600 pods - master nodes load average is lava because they are rescheduling 300 pods - when the failing worker node is back alive, he is empty because every pods are on the other worker node.
The only solution I found: Making 2 deployments for every applications (one per worker) to prevent the rescheduling of 300 pods.
Are there better solutions please ?