0
votes

I have a kops cluster with a max of 75 nodes and added with cluster autoscaler. It uses kubenet networking. Things have currently stopped working - ie scale down is no longer happening.

The cluster is running on max capacity ie 75 nodes even with almost no load. Not sure where to start to troubleshoot the problem.

See the following errors in the cluster autoscaler pod

    I0222 01:45:14.327164       1 static_autoscaler.go:97] Starting main loop
W0222 01:45:14.770818       1 static_autoscaler.go:150] Cluster is not ready for autoscaling
I0222 01:45:15.043126       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I0222 01:45:17.121507       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I0222 01:45:19.126665       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I0222 01:45:21.327581       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I0222 01:45:23.331802       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I0222 01:45:24.775124       1 static_autoscaler.go:97] Starting main loop
W0222 01:45:25.085442       1 static_autoscaler.go:150] Cluster is not ready for autoscaling

Autoscaling was working fine.

Update, also see the following errors when running kops validate cluster

    VALIDATION ERRORS
    KIND    NAME                MESSAGE
    Node    ip-172-20-32-173.ec2.internal   node "ip-172-20-32-173.ec2.internal" is not ready
 ...

I0221 22:16:02.688911    2403 node_conditions.go:60] node "ip-172-20-51-238.ec2.internal" not ready: &NodeCondition{Type:NetworkUnavailable,Status:True,LastHeartbeatTime:2019-02-21 22:15:56 -0500 EST,LastTransitionTime:2019-02-21 22:15:56 -0500 EST,Reason:NoRouteCreated,Message:RouteController failed to create a route,}
1
Can you grep log from cluster-autoscaler for cannot be removed? How are allocated resources looking on the nodes?Crou

1 Answers

1
votes

I found out the problem was that my Cluster had gone into an Unhealthy state because of this limitation in AWS VPC routing tables.My cluster had scaled to 75 nodes and then had become unhealthy and was not able to scale down.

From the link,

One important limitation when using kubenet networking is that an AWS routing table cannot have more than 50 entries, which sets a limit of 50 nodes per cluster.