6
votes

I understand that cudaMemcpy will synchronize host and device, but how about cudaMalloc or cudaFree?

Basically I want to asynchronize memory allocation/copy and kernel executions on multiple GPU devices, and a simplified version of my code is something like this:

void wrapper_kernel(const int &ngpu, const float * const &data)
{
 cudaSetDevice(ngpu);
 cudaMalloc(...);
 cudaMemcpyAsync(...);
 kernels<<<...>>>(...);
 cudaMemcpyAsync(...);
 some host codes;
}

int main()
{
 const int NGPU=3;
 static float *data[NGPU];
 for (int i=0; i<NGPU; i++) wrapper_kernel(i,data[i]);
 cudaDeviceSynchronize();
 some host codes;
}

However, the GPUs are running sequentially, and can't find why.

2
Yes, cudaMalloc and cudaFree are blocking and synchronize across all kernels executing on the current GPU.Jared Hoberock
@Jared Hoberock Thanks! So, is there an synchronized version of cudaMalloc or cudaFree, like cudaMemcpyAsyn?Hailiang Zhang
@Jared Hoberock I meant an asynchronous version in the above comment.Hailiang Zhang
No, but you might be able to make your own by calling malloc and free from inside a kernel.Jared Hoberock
@Jared Hoberock I have CUDA4.0, and I doubt calling malloc and free from inside a kerne will be supportedHailiang Zhang

2 Answers

2
votes

Try using cudaStream_t for each GPU. Below is simpleMultiGPU.cu taken from CUDA sample.

 //Solver config                                                          
TGPUplan      plan[MAX_GPU_COUNT];
//GPU reduction results                                                                                   
float     h_SumGPU[MAX_GPU_COUNT];

....memory init....

//Create streams for issuing GPU command asynchronously and allocate memory (GPU and System page-locked)                             for (i = 0; i < GPU_N; i++)
{
    checkCudaErrors(cudaSetDevice(i));
    checkCudaErrors(cudaStreamCreate(&plan[i].stream));
    //Allocate memory                                                                                                                    checkCudaErrors(cudaMalloc((void **)&plan[i].d_Data, plan[i].dataN * sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&plan[i].d_Sum, ACCUM_N * sizeof(float)));
    checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Sum_from_device, ACCUM_N * sizeof(float)));
    checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Data, plan[i].dataN * sizeof(float)));

    for (j = 0; j < plan[i].dataN; j++)
    {
        plan[i].h_Data[j] = (float)rand() / (float)RAND_MAX;
    }
}

....kernel, memory copyback....

and here's some guide of using multi gpu.

1
votes

The most likely reason you are seeing GPU operations running sequentially is that cudaMalloc is asynchronous, but cudaFree is not (just queue them on the CPU thread and send the free requests at the end of a series of operations).

You need to use pinned memory for asynchronous transfers from CPU memory to GPU memory, Cuda provides two utility functions cudaMallocHost and cudaFreeHost (instead of malloc and free). BTW, there is also cudaHostAlloc for finer control, read the CUDA manual for details.