1
votes

I'm trying to setup Kubernetes with Nvidia GPU nodes/slaves. I followed the guide at https://docs.nvidia.com/datacenter/kubernetes-install-guide/index.html and I was able to get the node join the cluster. I tried the below kubeadm example pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-base
      command: ["sleep"]
      args: ["100000"]
      extendedResourceRequests: ["nvidia-gpu"]
  extendedResources:
    - name: "nvidia-gpu"
      resources:
        limits:
          nvidia.com/gpu: 1
      affinity:
        required:
          - key: "nvidia.com/gpu-memory"
            operator: "Gt"
            values: ["8000"]

The pod fails scheduling & the kubectl events shows:

4s          2m           14        gpu-pod.15487ec0ea0a1882        Pod                                          Warning   FailedScheduling        default-scheduler            0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 PodToleratesNodeTaints.

I'm using AWS EC2 instances. m5.large for the master node & g2.8xlarge for the slave node. Describing the node also gives "nvidia.com/gpu: 4". Can anybody help me out if I'm missing any steps/configurations?

1
Could you share results of a command for your nvidia workers: kubectl describe nodes. And also share results of commands: kubectl describe pods gpu-pod and kubectl logs gpu-pod. Information you provided is not enough to understand what is happening.Artem Golenyaev
@ArtemGolenyaev Adding a Google docs links with the requested logs: linkAditya Abinash
Looks like you don't have enough resources for scheduling the the pod, try to decrease amount of required nvidia.com/gpu-memory memory or expand resources of Nvidia GPU nodesArtem Golenyaev
The pod is requesting 8 GB memory whereas a g2.8xlarge instance has 60 GB.Aditya Abinash

1 Answers

1
votes

According to the AWS G2 documentation, g2.8xlarge servers have the following resources:

  • Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4 GB of video memory and the ability to encode either four real-time HD video streams at 1080p or eight real-time HD video streams at 720P.
  • 32 vCPUs.
  • 60 GiB of memory.
  • 240 GB (2 x 120) of SSD storage.

Looking at the comments, 60 GB is standard RAM, and it is used for regular calculations. g2.8xlarge servers have 4 GPUs with 4 GB of GPU memory each, and this memory is used for calculations in nvidia/cuda containers.

In your case, it is requested 8 GB of GPU memory per GPU, but your server has only 4 GB. Therefore, the cluster experiences a lack of resources for scheduling the POD. So, try to reduce the memory usage in the Pod settings or try to use a server with a larger amount of GPU memory.