3
votes

I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.

My nvidia drivers and cuda drivers are all accessible inside my pod at /usr/local:

$> ls /usr/local
bin  cuda  cuda-10.0  etc  games  include  lib  man  nvidia  sbin  share  src

And my GPU is also recongnized by my image nvidia/cuda:10.0-runtime-ubuntu18.04:

$> /usr/local/nvidia/bin/nvidia-smi
Fri Nov  8 16:24:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But after installing pytorch 1.3.0 i'm not able to make pytorch recognize my cuda installation even with LD_LIBRARY_PATH set to /usr/local/nvidia/lib64:/usr/local/cuda/lib64:

$> python3 -c "import torch; print(torch.cuda.is_available())"
False

$> python3
Python 3.6.8 (default, Oct  7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print ('\t\ttorch.cuda.current_device()    =', torch.cuda.current_device())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 386, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 111, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

The error above is strange because my cuda version for my image is 10.0 and Google GKE mentions that:

The latest supported CUDA version is 10.0

Also, it's GKE's daemonsets that automatically installs NVIDIA drivers

After adding GPU nodes to your cluster, you need to install NVIDIA's device drivers to the nodes.

Google provides a DaemonSet that automatically installs the drivers for you. Refer to the section below for installation instructions for Container-Optimized OS (COS) and Ubuntu nodes.

To deploy the installation DaemonSet, run the following command: kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

I have tried everything i could think of, without success...

1
Does the same container work locally (assuming you have nvidia hardware locally) using docker run on a local machine or on a stand alone GCE VM using a GPU?Patrick W

1 Answers

0
votes

I have resolved my problem by downgrading my pytorch version by buildling my docker images from pytorch/pytorch:1.2-cuda10.0-cudnn7-devel.

I still don't really know why before it was not working as it should otherwise then by guessing that pytorch 1.3.0 is not compatible with cuda 10.0.