CUDA Asynchronous Memory Copy - Which hardware device performs the memory copy operation?

Question

I have been doing some research on asynchronous CUDA operations, and read that there is a kernel execution ("compute") queue, and two memory copy queues, one for host to device (H2D) and one for device to host (D2H).

It is possible for operations to be running concurrently in each of these queues. If I understand correctly, then there can be up to 16 kernels executing at once in the compute queue. (32 on some more modern archetectures.)

However, there can be only 1 memory transfer occuring simultaneously in the D2H and H2D queues. If both are used concurrently, this is a total of two simultaneous memory transfers, in the two different directions.

Assuming I understood this all correctly, my question is which device "manages" the transfer of data?

Further reading indicates that the GPU has direct memory access (DMA) to host (CPU) memory (RAM). This would suggest that the CUDA device (the GPU) contains a processor which manages the memory transfer. Perhaps this "processor" is some kind of memory controller which resides inside the main GPU silicon and communicates with host memory directly via the PCI-e bus?

Is my understanding correct?

I was initially confused when I read that the GPU can execute CUDA kernels simultaneously while memory transfers occur, and that in addition to this, asynchronous CUDA operations are non-blocking with respect to the host CPU.

This confused me because I had initially assumed that the host CPU was responsible for feeding data to/from (host) RAM to the GPU via the PCI-e bus.

The GPU has one or two onboard DMA engines which can directly access pinned memory over the PCI-e bus without interaction with the host GPU — talonmies
@talonmies presumably these DMA engines are processors built into the silicon of the main GPU processing chip? This is what I am asking — FreelanceConsultant
They are shown in block diagrams of the GPU. Unless you want to track down SEM images or x-rays of actual dies and reverse engineer them, you are in the realm of proprietary information which isn't public. And way off topic for Stack Overflow — talonmies
@talonmies Presumably if the block diagrams are in the public domain then this is not proprietary information? This is a mixed software engineering and hardware engineering question. It is not uncommon for software engineers to ask hardware engineers or firmware engineers how things are implemented in firmware and hardware to better understand how to engineer their software. — FreelanceConsultant

einpoklum einpoklum · Accepted Answer · 2021-06-05T14:48:09

I have been doing some research on asynchronous CUDA operations, and read that there is a kernel execution ("compute") queue, and two memory copy queues, one for host to device (H2D) and one for device to host (D2H).

You read wrong. In CUDA-supporting GPUs, you have a larger number of queues; and each one of them can be used to scheduled all kinds of work: computation, D2H transfers, H2D transfers and other actions.

The question of how many transfers can execute in-actuallly-parallel depends on the GPU specifics. Typically, as @talonmies notes, NVIDIA GPUs have one or two of these. But also remember that the throughput/bandwidth is limited by the PCIe bus, so even if you had more transfer engines - you would not have gotten better throughout.

When there's a single engine usable for D2H and a single one for H2D, it makes sense to dedicate one queue for each of the directions of transfer, since the transfers will be serialized anyway.

I read that the GPU can execute CUDA kernels simultaneously while memory transfers occur

That is true, on all NVIDIA (and AMD) GPUs.

asynchronous CUDA operations are non-blocking with respect to the host CPU.

That's also true. But you can launch a kernel and wait on it to conclude. In fact, that is the default behavior if you use the CUDA runtime API and don't specify an asynchronous stream for scheduling the kernel.

I had initially assumed that the host CPU was responsible for feeding data to/from (host) RAM to the GPU via the PCI-e bus.

The veracity of this statement does not determine whether or not the GPU can execute kernels asynchronously or not. And, in fact, there are different hardware components, including the CPU, which may be involved with such data transfer; but not all GPU <-> CPU data transfer requires CPU involvement.

CUDA Asynchronous Memory Copy - Which hardware device performs the memory copy operation?

1 Answers