1
votes

I would like to improve my "data transfer" algorithm between MPI-CPU-node and a single GPU.

With NUMPROCS nodes, Each MPI-node has a 1D array with Ntot/NUMPROCESS float4.

My algo is very simple:

1) the 1D arrays are gathered (MPI_GATHER) in a big array (size Ntot) on the master node.

2) With the master node, the big array is sent to the GPU via cudaMemcpy function. The CUDA kernel is launched with the master node.

Is it possible to avoid the first step? I mean, each MPI-node sends its array via cudaMemcpy and the concatenation is done directly on the memory of the GPU.

1
Are the processes participating in the MPI_GATHER running on the same physical host as the GPU, or are they communicating over a network?talonmies
Open MPI supports experimental direct GPU memory transfer for some operations. MPI_Gather is one of the supported operations. The code is in the trunk so you should compile your own Open MPI from the SVN sources. See here and here.Hristo Iliev
I use MPI_gather on the 4 cores of my quad-core CPU. my GPU is a GTX680. I use CUDA 4.2 and MPICH2 version 1.2.1p1SystmD
Open MPI seems a good idea, but I don't find any tutorial MPI+CUDA... Maybe there is another solution using cudastream?SystmD
In the trunk Open MPI it works by just passing the device pointer as the source data buffer argument to MPI_Gather. Open MPI detects that the pointer is a device pointer and uses CUDA functions behind the scenes.Hristo Iliev

1 Answers

1
votes

Since your MPI-CPU nodes are running on the same physical host as the GPU, you can avoid your first step.

You can use the asynchronous function CudaMemcpyAsync() to do your second step. The function has a stream param. It helps with doing GPU computing and memcpy at the same time.

In each process,you can use CudaSetDevice(devicenumber) to control the GPU you choose.

For details,see the CUDA Manual.