I'm trying to setup Kubernetes with Nvidia GPU nodes/slaves. I followed the guide at https://docs.nvidia.com/datacenter/kubernetes-install-guide/index.html and I was able to get the node join the cluster. I tried the below kubeadm example pod:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-base
command: ["sleep"]
args: ["100000"]
extendedResourceRequests: ["nvidia-gpu"]
extendedResources:
- name: "nvidia-gpu"
resources:
limits:
nvidia.com/gpu: 1
affinity:
required:
- key: "nvidia.com/gpu-memory"
operator: "Gt"
values: ["8000"]
The pod fails scheduling & the kubectl events shows:
4s 2m 14 gpu-pod.15487ec0ea0a1882 Pod Warning FailedScheduling default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 PodToleratesNodeTaints.
I'm using AWS EC2 instances. m5.large for the master node & g2.8xlarge for the slave node. Describing the node also gives "nvidia.com/gpu: 4". Can anybody help me out if I'm missing any steps/configurations?
kubectl describe nodes
. And also share results of commands:kubectl describe pods gpu-pod
andkubectl logs gpu-pod
. Information you provided is not enough to understand what is happening. – Artem Golenyaevnvidia.com/gpu-memory
memory or expand resources of Nvidia GPU nodes – Artem Golenyaev