2
votes

I have few clusters with 3 node in each cluster's nodepools in my GCP project and has auto-upgrade and repair enabled.

The auto upgrade began approximately 3 days ago and is still running for the GKE version: 1.12.10-gke.17.

Now as my clusters are opted in for the auto-upgrade and auto repair, few clusters are getting upgraded without issues and few others are running update/upgrade with issues

ON my first cluster, few of my pods went unschedulable and the suggested possible actions by GCP is to

  • Enable Autoscaling in one or more node pools that have autoscaling disabled.
  • Increase size of one or more node pools manually.

when I run "gcloud container clusters describe "clustername" "zone" "

I get details of the cluster. however, under the nodepools section

 status: RUNNING_WITH_ERROR
  statusMessage: 'asia-south1-a: Timed out waiting for cluster initialization; cluster
    API may not be available: k8sclient: 7 - 404 status code returned. Requested resource
    not found.'
  version: 1.12.10-gke.17

NOTE:

I also see that the GCP suggests to

  • Enable autoscaling in one or more node pools that have autoscaling disabled.
  • Shrink one or more node pools manually.

because there is low resource requests.

Please let me know what other logs I can provide to resolve this issue.

Error Description and Activity

UPDATE:

We went through these logs and google support believes that it could be that the kubelet might be failing to submit a Certificate Signing Request (CSR) or that it might have old invalid credentials. To assist on the troubleshooting, might you answer these questions:

  1. sudo journalctl -u kubelet > kubelet.log
  2. sudo journalctl -u kube-node-installation > kube-node-installation.log
  3. sudo journalctl -u kube-node-configuration > kube-node-configuration.log
  4. sudo journalctl -u node-problem-detector > node-problem-detector.log
  5. sudo journalctl -u docker > docker.log
  6. sudo journalctl -u cloud-init > cloud-init.log

Any node that starts running 1.13.12-gke.13 fails to connect to master. Anything else that's happening to nodes (e.g. recreation) is because they are trying to fix them in a repair loop and doesn't seem to be causing additional problems.

1
Which version is the master and nodes on? It looks like your master is on 1.12.10 which is no longer supported, can you upgrade your master to a supported version such as 1.13.11-gke.14? - Patrick W
Also, are you having issues communicating with the master (using kubectl commands)? - Patrick W
If the master is stuck in repair or upgrpade status, Google NEEDS to take care of this. If the nodes are stuck in upgrading, manually delete them, new ones should be created with the correct version. - Patrick W
@PatrickW Thanks for your reply, - The issue is regarding upgrading the master and the nodes to the newer versions. so No, I'm unable to do the upgrade. No, I'm not having issues with Kubectl commands connecting to my pods on the nodes. Not sure about if the pods are on the master node. - I tried to delete and create new nodes, but it seems that it isn't possible during an upgrade with the UI. Should I try to force delete the nodes? IF I should, is that safe for my data? - I have already raised a support ticket to google and they are also looking into it. - Chronograph3r
Absolutely @mWatney. They are still troubleshooting with lots of logs. Are you also experiencing the same? - Chronograph3r

1 Answers

0
votes

This isn't exactly a solution but a working fix. We were able to do narrow down to this.

On the nodepools we had the labels "node-restriction" to what type of nodes should it be.

Google Support has also suggested that currently it is not possible to update the labels of an existing node-pool when it has begun an upgrade hence they suggested creating a new node-pool without any of these labels. In case if were able to deploy the node-pool successfully, we had to think of migrating our workloads to this newly created node-pool.

so we removed those two node selector labels and created a new nodepool. to our surprise it worked. We had to migrate the whole workload though.

we followed this Cloud Migration