I observe a strange behavior when overlapping data transfer and kernel execution in CUDA.
When calling cudaMemcpyAsync after cudaMemsetAsync although the cudaMemsetAsync does overlap with the compute kernel the cudaMemcpyAsync doesn't.
The compute kernel ends and then the cudaMemcpyAsync is executed.
When commenting out cudaMemsetAsync then the overlap is performed correctly.
Part of the code is presented below with some changes.
Code:
for (d = 0; d < TOTAL; ++d){
gpuErrchk(cudaMemsetAsync(data_d, 0, bytes, stream1));
for (j = 0; j < M; ++j)
{
gpuErrchk(cudaMemcpyAsync(&data_d[index1], &data_h[index2], bytes, H2D, stream1));
}
gpuErrchk(cudaStreamSynchronize(stream1));
cufftExecR2C(plan, data_d, data_fft_d);
gpuErrchk(cudaStreamSynchronize(stream2));
kernel<<dimGrid, dimBlock,0, stream3>>(result_d, data_fft_d, size);
}
I use a NVIDIA GTX-Titan GPU and the compute and memory operations are performed in different streams. Moreover, cudaMemsetAsync and cudaMemcpyAsync operate on the same device buffer.