I observe a strange behavior when overlapping data transfer and kernel execution in CUDA.
When calling cudaMemcpyAsync
after cudaMemsetAsync
although the cudaMemsetAsync
does overlap with the compute kernel the cudaMemcpyAsync
doesn't.
The compute kernel ends and then the cudaMemcpyAsync
is executed.
When commenting out cudaMemsetAsync
then the overlap is performed correctly.
Part of the code is presented below with some changes.
Code:
for (d = 0; d < TOTAL; ++d){
gpuErrchk(cudaMemsetAsync(data_d, 0, bytes, stream1));
for (j = 0; j < M; ++j)
{
gpuErrchk(cudaMemcpyAsync(&data_d[index1], &data_h[index2], bytes, H2D, stream1));
}
gpuErrchk(cudaStreamSynchronize(stream1));
cufftExecR2C(plan, data_d, data_fft_d);
gpuErrchk(cudaStreamSynchronize(stream2));
kernel<<dimGrid, dimBlock,0, stream3>>(result_d, data_fft_d, size);
}
I use a NVIDIA GTX-Titan GPU and the compute and memory operations are performed in different streams. Moreover, cudaMemsetAsync
and cudaMemcpyAsync
operate on the same device buffer.