I have been doing some research on asynchronous CUDA operations, and read that there is a kernel execution ("compute") queue, and two memory copy queues, one for host to device (H2D) and one for device to host (D2H).
It is possible for operations to be running concurrently in each of these queues. If I understand correctly, then there can be up to 16 kernels executing at once in the compute queue. (32 on some more modern archetectures.)
However, there can be only 1 memory transfer occuring simultaneously in the D2H and H2D queues. If both are used concurrently, this is a total of two simultaneous memory transfers, in the two different directions.
Assuming I understood this all correctly, my question is which device "manages" the transfer of data?
Further reading indicates that the GPU has direct memory access (DMA) to host (CPU) memory (RAM). This would suggest that the CUDA device (the GPU) contains a processor which manages the memory transfer. Perhaps this "processor" is some kind of memory controller which resides inside the main GPU silicon and communicates with host memory directly via the PCI-e bus?
Is my understanding correct?
I was initially confused when I read that the GPU can execute CUDA kernels simultaneously while memory transfers occur, and that in addition to this, asynchronous CUDA operations are non-blocking with respect to the host CPU.
This confused me because I had initially assumed that the host CPU was responsible for feeding data to/from (host) RAM to the GPU via the PCI-e bus.