I understand that when copy operation between host and device starts using cudaMemcpy, the host pointer is pinned automatically. Then what is the meaning and necessity of having a separate API cudAHostAlloc() for allocating pinned host memory?
3
votes
I think your understanding is wrong. Copying to unpinned memory involves an intermediate driver managed buffer, which is why it is slower.
– talonmies
So you mean from host, data is first put in a driver manged (pinned) buffer, and then from there it is transferred to device?
– gpuguy
@gpuguy: yes. You can allocate your own staging buffer with cudaMallocHost() and use CUDA events to write your own pageable memcpy routine - see e.g. cudahandbook.to/GH9evr
– ArchaeaSoftware
@ArchaeaSoftware but how does the driver has access to MMU for allocating staging buffer on host? Isn't this is the job of OS ?
– gpuguy
Pinned memory has to be mapped for both the CPU and the GPU. Mapping it for the CPU gets done by the OS, but mapping it for the GPU gets done by the driver.
– ArchaeaSoftware
1 Answers
4
votes
The two operations are not the same, and the host pointer you pass to cudaMemcpy
is not "pinned automatically".
For transfers from pageable memory to the device, the host memory is copied to a staging buffer. The staging buffer is then the target of any transfers.
This makes the pageable memory transfers slower (typically) than transfers from pinned memory buffers. Using pinned memory also allows for other possibilities, such as having mapped memory that is directly accessible by the device, without an explicit (API-level) transfer.