0
votes

I am trying to provision a GKE cluster with windows node_pool using google modules, I am calling module

  source  = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster-update-variant"
  version = "9.2.0"

I had to define two pools one for linux pool required by GKE and the windows one we require, terraform always succeeds in provisioning the linux node_pool but fails to provision the windows one and the error message

module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m31s elapsed]
module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m41s elapsed]
module.gke.google_container_cluster.primary: Still modifying... [id=projects/uk-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev, 24m51s elapsed]
module.gke.google_container_cluster.primary: Modifications complete after 24m58s [id=projects/xx-xxx-xx-xxx-b821/locations/europe-west2/clusters/gke-nonpci-dev]
module.gke.google_container_node_pool.pools["windows-node-pool"]: Creating...

Error: error creating NodePool: googleapi: Error 400: Workload Identity is not supported on Windows nodes. Create the nodepool without workload identity by specifying --workload-metadata=GCE_METADATA., badRequest

  on .terraform\modules\gke\terraform-google-kubernetes-engine-9.2.0\modules\beta-private-cluster-update-variant\cluster.tf line 341, in resource "google_container_node_pool" "pools":
 341: resource "google_container_node_pool" "pools" {

I tried many places to set this metadata values but I coldn't get it right:

From terraform side :

I tried many places to add this metadata inside the node_config scope in the module itself or in my main.tf file where I call the module I tried to add it to the windows node_pool scope of the node_pools list but it didn't accept it with a message that setting WORKLOAD IDENTITY isn't expected here

I tried also setting enable_shielded_nodes = false but this didn't really help much.

I tried to test this if it is doable even through the command line this was my command line

C:\>gcloud container node-pools --region europe-west2 list
NAME                    MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
default-node-pool-d916  n1-standard-2  100           1.17.9-gke.600

 
C:\>gcloud container node-pools --region europe-west2 create window-node-pool --cluster=gke-nonpci-dev --image-type=WINDOWS_SAC --no-enable-autoupgrade --machine-type=n1-standard-2
WARNING: Starting in 1.12, new node pools will be created with their legacy Compute Engine instance metadata APIs disabled by default. To create a node pool with legacy instance metadata endpoints disabled, run `node-pools create` with the flag `--metadata disable-legacy-endpoints=true`.
This will disable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=Workload Identity is not supported on Windows nodes. Create the nodepool without workload identity by specifying --workload-metadata=GCE_METADATA.

C:\>gcloud container node-pools --region europe-west2 create window-node-pool --cluster=gke-nonpci-dev --image-type=WINDOWS_SAC --no-enable-autoupgrade --machine-type=n1-standard-2 --workload-metadata=GCE_METADATA --metadata disable-legacy-endpoints=true
This will disable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
ERROR: (gcloud.container.node-pools.create) ResponseError: code=400, message=Service account "[email protected]" does not exist.

C:\>gcloud auth list
                       Credentialed Accounts
ACTIVE  ACCOUNT
*       [email protected]

This service account from running gcloud auth list is the one I am running terraform with but I don't know where is this one in the error message coming from, even though trying to create the windows nodepool through command line as shown above also didn't work I am a bit stuck and I don't know what to do.

As module 9.2.0 is a stable module for us through all our linux based clusters we setup before, hence I thought this may be an old version for a windows node_pool I used the 11.0.0 instead to see if this would make it any different but ended up with a different error

module.gke.google_container_node_pool.pools["default-node-pool"]: Refreshing state... [id=projects/uk-tix-p1-npe-b821/locations/europe-west2/clusters/gke-nonpci-dev/nodePools/default-node-pool-d916]

Error: failed to execute ".terraform/modules/gke.gcloud_delete_default_kube_dns_configmap/terraform-google-gcloud-1.4.1/scripts/check_env.sh": fork/exec .terraform/modules/gke.gcloud_delete_default_kube_dns_configmap/terraform-google-gcloud-1.4.1/scripts/check_env.sh: %1 is not a valid Win32 application.

  on .terraform\modules\gke.gcloud_delete_default_kube_dns_configmap\terraform-google-gcloud-1.4.1\main.tf line 70, in data "external" "env_override":
  70: data "external" "env_override" {

Error: failed to execute ".terraform/modules/gke.gcloud_wait_for_cluster/terraform-google-gcloud-1.3.0/scripts/check_env.sh": fork/exec .terraform/modules/gke.gcloud_wait_for_cluster/terraform-google-gcloud-1.3.0/scripts/check_env.sh: %1 is not a valid Win32 application.

  on .terraform\modules\gke.gcloud_wait_for_cluster\terraform-google-gcloud-1.3.0\main.tf line 70, in data "external" "env_override":
  70: data "external" "env_override" {

This how I set node_pools parameters


  node_pools = [
    {
      name               = "linux-node-pool"
      machine_type       = var.nodepool_instance_type
      min_count          = 1
      max_count          = 10
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = "COS"                                  
      auto_repair        = true                                   
      auto_upgrade       = true                                 
      service_account    = google_service_account.gke_cluster_sa.email
      preemptible        = var.preemptible
      initial_node_count = 1
    },
    {
      name               = "windows-node-pool"
      machine_type       = var.nodepool_instance_type
      min_count          = 1
      max_count          = 10
      disk_size_gb       = 100
      disk_type          = "pd-standard"
      image_type         = var.nodepool_image_type                
      auto_repair        = true                                   
      auto_upgrade       = true                                   
      service_account    = google_service_account.gke_cluster_sa.email
      preemptible        = var.preemptible
      initial_node_count = 1
  
    }
  ]

  cluster_resource_labels = var.cluster_resource_labels           

  # health check and webhook firewall rules
  node_pools_tags = {
    all = [
      "xx-xxx-xxx-local-xxx",
    ]
  }

  node_pools_metadata = {
    all = {
//      workload-metadata = "GCE_METADATA"
    }

    linux-node-pool = {
      ssh-keys = join("\n", [for user, key in var.node_ssh_keys : "${user}:${key}"])
      block-project-ssh-keys = true
    }

    windows-node-pool = {
      workload-metadata = "GCE_METADATA"
    }

  }

  • this is a shared VPC where I provision my cluster with cluster version: 1.17.9-gke.600
1

1 Answers

1
votes

Checkout https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/632 for the solution.

Error message is ambiguous and GKE has an internal bug to track this issue. We will improve the error message soon.