I understand that cudaMemcpy will synchronize host and device, but how about cudaMalloc or cudaFree?
Basically I want to asynchronize memory allocation/copy and kernel executions on multiple GPU devices, and a simplified version of my code is something like this:
void wrapper_kernel(const int &ngpu, const float * const &data)
{
cudaSetDevice(ngpu);
cudaMalloc(...);
cudaMemcpyAsync(...);
kernels<<<...>>>(...);
cudaMemcpyAsync(...);
some host codes;
}
int main()
{
const int NGPU=3;
static float *data[NGPU];
for (int i=0; i<NGPU; i++) wrapper_kernel(i,data[i]);
cudaDeviceSynchronize();
some host codes;
}
However, the GPUs are running sequentially, and can't find why.
cudaMalloc
andcudaFree
are blocking and synchronize across all kernels executing on the current GPU. – Jared Hoberockmalloc
andfree
from inside a kernel. – Jared Hoberock