11
votes

I've added an GeForce GTX 1080 Ti into my machine (Running Ubuntu 18.04 and Anaconda with Python 3.7) to utilize the GPU when using PyTorch. Both cards a correctly identified:

$ lspci | grep VGA
03:00.0 VGA compatible controller: NVIDIA Corporation GF119 [NVS 310] (reva1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

The NVS 310 handles my 2-monitor setup, I only want to utilize the 1080 for PyTorch. I also installed the latest NVIDIA drivers that are currently in the repository and that seems to be fine:

$ nvidia-smi 
Sat Jan 19 12:42:18 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVS 310             Off  | 00000000:03:00.0 N/A |                  N/A |
| 30%   60C    P0    N/A /  N/A |    461MiB /   963MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   41C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Driver version 390.xx allows to run CUDA 9.1 (9.1.85) according the the NVIDIA docs. Since this is also the version in the Ubuntu repositories, I simple installed the CUDA Toolkit with:

$ sudo apt-get-installed nvidia-cuda-toolkit

And again, this seems be alright:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

and

$ apt-cache policy nvidia-cuda-toolkit
nvidia-cuda-toolkit:
  Installed: 9.1.85-3ubuntu1
  Candidate: 9.1.85-3ubuntu1
  Version table:
 *** 9.1.85-3ubuntu1 500
        500 http://sg.archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages
        100 /var/lib/dpkg/status

Lastly, I've installed PyTorch from scratch with conda

conda install pytorch torchvision -c pytorch

Also error as far as I can tell:

$ conda list
...
pytorch                   1.0.0           py3.7_cuda9.0.176_cudnn7.4.1_1    pytorch
...

However, PyTorch doesn't seem to find CUDA:

$ python -c 'import torch; print(torch.cuda.is_available())'
False

In more detail, if I force PyTorch to convert a tensor x to CUDA with x.cuda() I get the error:

Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from 82 http://...

What am I'm missing here? I'm new to this, but I think I've checked the Web already quite a bit to find any caveats like NVIDIA driver and CUDA toolkit versions?

EDIT: Some more outputs from PyTorch:

print(torch.cuda.device_count())   # --> 0
print(torch.cuda.is_available())   # --> False
print(torch.version.cuda)          # --> 9.0.176
2
I would get rid of the NVS 310. And I would verify the CUDA install using the instructions in the linux install guide provided by NVIDIA. Build and run a sample code like vectorAdd or bandwidthTest. If they don't work correctly, then your CUDA install is broken.Robert Crovella
I've actually just read the the PyTorch binaries come bundled with the required CUDA and cuDNN stuff. So removed the CUDA Toolkit right now. I actually took the 1080 from a machine with the same setup with a NVS 310 where it worked. I thought that it would help for some load balancing.Christian
@RobertCrovella I've tested it with only the 1080 and it works with that. When using only this card I could also use a newer driver (415 instead of 390). I also tried only the NVS 310 with the 390 driver. I knew that the compute capability of this card is too low, but I remember that the error was accordingly and not just saying that no driver was found. This time, however, it couldn't even find/see the driver. So, yeah, I will just leave the 1080 in there for now. Thanks!Christian
You PyTorch build is for CUDA 9.0.176 while you have CUDA 9.1.85 installed. However, remove CUDA installation and let anaconda install it for you.Luca Di Liello

2 Answers

0
votes

Since you had two graphic cards, selecting a card ID CUDA_VISIBLE_DEVICES=GPU_ID should fix the problem as per this explanation.

0
votes

I have had the same issue when trying to use PyTorch to train in our server (has 4 GPUs), so I didn't have the option of just removing the GPUs.

However, I am using docker and docker-compose to run my training. Thus I found this pytorch image from nvidia that comes with all the necessary setup. Please before you pull the image, make sure to check this page to determine which image tag is compatible with your nvidia driver version (if you pull the wrong one, it won't work).

Then, in your docker-compose file, you can specify which GPUs to use as follow:

version: '3.5'

services:
  training:
    build:
      context: ""
      dockerfile: Dockerfile
    container_name: training
    environment:
      - CUDA_VISIBLE_DEVICES=0,2
    ipc: "host"

Make sure to set ipc to "host", which will allow your docker container to use the host shared memory and not the one allocated to docker (insufficient).