when is calling to the cudaDeviceSynchronize
function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize
after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize
, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize
apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with
cudaMemcpy
?If I do matrix multiplications like
C = A * B D = C * F
should I put cudaDeviceSynchronize
between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize
slow the program so much?