3
votes

I am trying to execute MPI and CUDA code on a cluster. The code works fine on single machine but when I try to execute it on cluster I get error:

error while loading shared libraries: libcudart.so.4: cannot open shared object file: No such file or directory

I checked my PATH and LD_PATH and it looks ok. I have a .bashrc file which contains following entries -

export PATH=$PATH:/usr/local/lib/:/usr/local/lib/openmpi:/usr/local/cuda/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/ lib/openmpi/:/usr/local/cuda/lib

All the machines haves same installation of CUDA and OpenMPI.

I also have /usr/local/cuda/lib in /etc/ld.so.conf

Can anyone help me with this. This problem is really annoying.

Thanks.

1
What are you using to initialize the cluster? - rudolph9

1 Answers

5
votes

If you are sending a batch job on a cluster, please add commands like

echo $LD_LIBRARY_PATH 
ldd ./your_app 

to your batch script. This should help to debug the problem.

Also make sure that you export environment variables in mpirun. For instance, in OpenMPI you would run your code with

mpirun -x LD_LIBRARY_PATH ...