1
votes

Consider the cuda code below:

CudaMemCpyAsync(H2d, data1...., StreamA);
KernelB<<<..., StreamB>>>(data1,...);
CudaMemCpyAsync(D2H, output using data1, ...., StreamA);

When does "CudaMemCpyAsync(D2H....., StreamA);" in the code starts? Does it start after end of execution of KernelB? Do I replace "CudaMemCpyAsync(D2H....., StreamA);" with "CudaMemCpy(D2H....., StreamA);" if I have to copy output of KernelB back to the host?

Also, is pinned memory usage is absolutely required in asynchronous data transfer?

Thanks in advance.

1

1 Answers

3
votes

The user created CUDA streams are asynchronous with respect to each other and with respect to the host. The tasks issued to same CUDA stream are serialized. So in your case, cudaMemCpyAsync(D2H, output using data1, ...., StreamA); will wait for the previous memory copy to finish. But there is no guarantee that when this memory copy initiates, the kernel would have finished its execution. Because StreamA and StreamB are asynchronous w.r.t each other.

Also, the host will not wait for these streams to finish execution.

If you want the host to wait for the streams, you may use cudaDeviceSynchronize or cudaStreamSynchronize.

If you do not use pinned memory, the memory copies will not overlap with kernel execution.