0
votes

I created an azure AKS with 3 nodes(Standard DS3 v2 (4 vcpus, 14 GB memory)). I was fiddling with the cluster and created a Deployment with 1000 replicas.After this complete cluster went down.

azureuser@saa:~$ k get cs
NAME                 STATUS      MESSAGE                                                                                        ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: getsockopt: connection refused   
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused   
etcd-0               Healthy     {"health": "true"}  

From debugging it seems both Scheduler and Controller-manager went down. How to Fix this?

What exactly happened when created a Deployment with 1000 replicas? Should it be taken care by k8s?

Few debugging commands output:

  kubectl cluster-info
    Kubernetes master is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443
    Heapster is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/heapster/proxy
    KubeDNS is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
    kubernetes-dashboard is running at https://cg-games-e5252212.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy

Logs for kubectl cluster-info dump @ http://termbin.com/e6wb

azureuser@sim:~$ az aks scale -n cg -g cognitive-games -c 4 --verbose
Deployment failed. Correlation ID: 4df797b2-28bf-4c18-a26a-4e341xxxxx. Operation failed with status: 200. Details: Resource state Failed

no nodes displayed

azureuser@si:~$ k get nodes
No resources found
1
Hi, As you are using AKS which means Kubernetes master is managed by Azure. In the above scenario Scheduler and Controller-manager are not down. You can see It says connection refused. - Suresh Vishnoi
you can check further information by using kubectl get events - Suresh Vishnoi
@SureshVishnoi kubectl get events says No resources found. - StateLess
Hi, Cluster is up. We need to get logs of nodes or whole cluster to diagnose the issues. you can run kubectl cluster-info - Suresh Vishnoi
@SureshVishnoi updated with required logs - StateLess

1 Answers

0
votes

Looks silly but when AKS is created in an RG, surprisingly two RGs are created one with the AKS and another one with some random hash having all the VMS. I've deleted the 2nd RG and the basic AKS stopped working.