0
votes

I'm using MPI+CUDA mixed mode to program a GPU cluster for matrix multiplication. When I offload the multiplication operations to the GPUs via MPI and CUDA, it gives an error message at run time:

FATAL: Error inserting nvidia (/lib/modules/3.2.0-23-generic-pae/kernel/drivers/video/nvidia.ko): No such device

MPI is used to transfer the data blocks and then upon receiving the data, a generic C function is called that triggers a CUDA kernel. Test setup has 3 machines, each has single GPU. I tested with a CUDA only local version version. I didn't get any error messages, but the answers of the algorithms were wrong (Even for the small simple algorithms).

What's the reason for this error? Please note that this is only when I try to use the MPI with CUDA. CUDA only version works well. Thanks in advance.

1
Looks like the most common cause of the error is that the device is already controlled by the nouveau driver. But then it shouldn't be related to MPI...Roger Dahl
MPI often implies accessing other machines in the cluster, besides the one on which the job was launched. If those other machines have configuration issues, then this problem or any of a number of other messages might occur. I think there's simply not enough to go on in this question to formulate any reasonable suggestions, but maybe someone else will have a suggestion. For example, it would be instructive to know the actual MPI launch command, the number of nodes being accessed, and whether or not this error message is orginating locally or being reported back from MPI.Robert Crovella
Also, what is the machine config (number of GPUs/node) and does the problem occur if only the local machine is specified in the MPI machine file. A vague question, in my opinion.Robert Crovella
I'm sorry if it's too vague, since I'm still testing the I have a very limited knowledge in mixed mode programming. I have updated the question with the information you have asked for.Maddy
Can you run a non-CUDA MPI code successfully? A usual development sequence might be to get the code running properly without MPI first. Then add MPI, but launch a single rank on the local machine. After that is working, then add remote machines.Robert Crovella

1 Answers

0
votes

The errors have been caused because Nouveau is controlling the GPU, not the NVIDIA driver. So, before installing NVIDIA driver and CUDA toolkit, nouveau should be blacklisted.

sudo nano /etc/modprobe.d/blacklist.conf

Insert nouveau at the end of the file.

If the NVIDIA driver is already installed, then re-install the NVIDIA driver.