I would like to improve my "data transfer" algorithm between MPI-CPU-node and a single GPU.
With NUMPROCS nodes, Each MPI-node has a 1D array with Ntot/NUMPROCESS float4.
My algo is very simple:
1) the 1D arrays are gathered (MPI_GATHER) in a big array (size Ntot) on the master node.
2) With the master node, the big array is sent to the GPU via cudaMemcpy function. The CUDA kernel is launched with the master node.
Is it possible to avoid the first step? I mean, each MPI-node sends its array via cudaMemcpy and the concatenation is done directly on the memory of the GPU.
MPI_GATHER
running on the same physical host as the GPU, or are they communicating over a network? – talonmiesMPI_Gather
is one of the supported operations. The code is in the trunk so you should compile your own Open MPI from the SVN sources. See here and here. – Hristo IlievMPI_Gather
. Open MPI detects that the pointer is a device pointer and uses CUDA functions behind the scenes. – Hristo Iliev