Consider the cuda code below:
CudaMemCpyAsync(H2d, data1...., StreamA);
KernelB<<<..., StreamB>>>(data1,...);
CudaMemCpyAsync(D2H, output using data1, ...., StreamA);
When does "CudaMemCpyAsync(D2H....., StreamA);" in the code starts? Does it start after end of execution of KernelB? Do I replace "CudaMemCpyAsync(D2H....., StreamA);" with "CudaMemCpy(D2H....., StreamA);" if I have to copy output of KernelB back to the host?
Also, is pinned memory usage is absolutely required in asynchronous data transfer?
Thanks in advance.