1
votes

I try to create google_container_node_pool with GPUs. I tried machine type nvidia-tesla-p4 and a2-highgpu-1g, each return a different error:

projects/my-project-id/zones/us-central1-a/machineTypes/nvidia-tesla-p4

or

Error: error creating NodePool: googleapi: Error 403: Insufficient regional quota to satisfy request: resource "PREEMPTIBLE_NVIDIA_V100_GPUS": request requires '3.0' and is short '2.0'. project has a quota of '1.0' with '1.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=my-project-id., forbidden

When I check the quotas page, the relevant quota shows "All 99 quotas are within limit".

According to the requirement I need quota but they don't specify which quota.

Update:

Changing the machine_type to a2-highgpu-1g changed the error message to relate to a different quota, A2_CPUS. When I change the value of preemptible to false, instead of PREEMPTIBLE_NVIDIA_V100_GPUS or A2_CPUS I get the same error for NVIDIA_A100_GPUS. The problem with both A2_CPUS and NVIDIA_A100_GPUS is that I can't ask for quota as the checkbox in the UI is disabled and it shows limit as "Unlimited": enter image description here

2
It seems like there already 2 answers in this question. If one solves your question please vote on or accept one as detailed here. In case you need more clarification remember you can also comment on the answers as well.Judith Guzman
I gave up after few attempts, mostly due to other pressing matters, I will get back to it on the next version I hope, it may take few weeks.Johnathan Kanarek
Updated my question.Johnathan Kanarek
Please see my updated answer. TL;DR you should request an increase of the REGIONAL quota, as zonal quota is not actionable.Judith Guzman
Also, make sure you have enough CPU + A2 CPU quota in the regionJudith Guzman

2 Answers

1
votes

You don't see an error in the Quotas page because there wasn't a violation of your quotas, since the nodes weren't created.

For example, if you want to create a node pool with 3 nodes that each one has 1 V100 GPU, go to to the Quotas page and request to extend the number of PREEMPTIBLE_NVIDIA_V100_GPUS from 1 to 3. Repeat with the relevant numbers per each GPU and zone.
Please note that you should wait until GCP approves your requests before trying to create the resources again in Terraform.

If you don't wish to extend the quotas and just want to check your TF configuration, just minimize the number of GPU nodes to a number that doesn't violate your quotas.

1
votes

The first message you see is shown because there is not a machine-type named nvidia-tesla-p4 in GCP. In this document there is a comprehensive list of the available machine-types, but make sure to use a machine type available in the region and zone where you're spinning up your GKE cluster. You can check the valid machine-types available in a zone with this command: gcloud compute machine-types list --filter="zone:( ZONE … )"

Regarding the second message, it is clear that you don't have enough quota for that specific GPU in that region. As @hilsenrat has mentioned, you can't see any quotas being exhausted as the cluster never got created in the first place.

As mentioned in the Availability section of the documentation on running GPUs in GKE:

GPUs are available in specific regions and zones. When you request GPU quota, consider the regions in which you intend to run your clusters.

For a complete list of applicable regions and zones, refer to GPUs on Compute Engine.

To see a list of all GPU accelerator types supported in each zone, run the following command:gcloud compute accelerator-types list --filter="zone:( ZONE )"

As when you add a GPU to a preemptible instance, you use your regular GPU quota, I would also make sure that the quota for V100 in the REGION is enough. If you need a separate quota for preemptible GPUs, request a separate Preemptible GPU quota as described here.

I suggest going to the quota page and filtering these specific quotas, making sure you click on "ALL QUOTAS" under the Details column. Regional quotas will be displayed.

  • Service: Compute Engine API

  •   Name: GPUs (all regions)
    
  •   Quota Metric: compute.googleapis.com/gpus_all_regions
    
  •   Limit Name: GPUS-ALL-REGIONS-per-project
    
  • Service: Compute Engine API

  •    Name: NVIDIA V100 GPUs
    
  •    Quota Metric: compute.googleapis.com/nvidia_v100_gpus
    
  •    Limit Name: NVIDIA-V100-GPUS-per-project-zone/NVIDIA-V100-GPUS-per-project-region
    
  • Service: Compute Engine API

  •    Name: Preemptible NVIDIA V100 GPUs
    
  •    Quota Metric: compute.googleapis.com/preemptible_nvidia_v100_gpus
    
  •    Limit Name: PREEMPTIBLE-NVIDIA-V100-GPUS-per-project-zone/PREEMPTIBLE-NVIDIA-V100-GPUS-per-project-region
    

Make sure you have enough GLOBAL AND REGIONAL quota for the specific GPU model you are trying to use. Preemptible GPUs need to be requested separately as mentioned here.

------UPDATE----

Also, please note that only regional quotas can be requested for an increase. Any zonal quota listed is dependant on the corresponding regional quota. In this capture, even if the zonal limits read unlimited, the regional quota is 0 and attempting to use GPUs in the whole region will fail. (As you can see, only regional quota is selectable for edition).

Regional vs Zonal GPU quota

You mention that now you get a message mentioning you don't have enough quota for A2 CPUs. Please make sure to have enough CPU quota in the Region AND enough A2 CPU quota as well. For this you have to consider the number of vCPUs required for the machine type you want to deploy.

Selecting regional A2 CPU quota

You can read more about working with CPU quotas here.

I hope this information is useful an clarifies your question.