1
votes

I observe a strange behavior when overlapping data transfer and kernel execution in CUDA.

When calling cudaMemcpyAsync after cudaMemsetAsync although the cudaMemsetAsync does overlap with the compute kernel the cudaMemcpyAsync doesn't. The compute kernel ends and then the cudaMemcpyAsync is executed. When commenting out cudaMemsetAsync then the overlap is performed correctly.

Part of the code is presented below with some changes.

Code:

 for (d = 0; d < TOTAL; ++d){
     gpuErrchk(cudaMemsetAsync(data_d, 0, bytes, stream1));
     for (j = 0; j < M; ++j)
     {
         gpuErrchk(cudaMemcpyAsync(&data_d[index1], &data_h[index2], bytes, H2D, stream1));
     }

     gpuErrchk(cudaStreamSynchronize(stream1));
     cufftExecR2C(plan, data_d, data_fft_d);

     gpuErrchk(cudaStreamSynchronize(stream2));
     kernel<<dimGrid, dimBlock,0, stream3>>(result_d, data_fft_d, size);
 }

I use a NVIDIA GTX-Titan GPU and the compute and memory operations are performed in different streams. Moreover, cudaMemsetAsync and cudaMemcpyAsync operate on the same device buffer.

1
Can you present your code here? If yes, please provide it as complete as possible so the issue would be re-generatable by other people.Farzad

1 Answers

2
votes

Some of CUDA's memcpy functions are implemented with kernels (such as device->device memcpy), but ALL of CUDA's memset functions are implemented internally as kernels.

Assuming the cufftExecR2C call is supposed to be done in a different stream, you can bet that the kernel generated by the FFT plan was designed to fully occupy the GPU.

So you are likely hitting the same limitation in kernel concurrency that you would if you were trying to invoke a kernel in another stream. Kernels must occupy a limited amount of the GPU in order to run concurrently, but most CUDA kernels are not designed to accommodate that use case.