87
votes

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;

In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

3
Did you measure time of mlock itself?osgx
No, the actual time taken to execute the mlock call is presumed to be negligible(if that's what you're asking). The real overhead is the actual data transfer, which in my algorithm is a significant fraction of total cycle time.Gearoid Murphy
What is your CPU? Maybe, NUMA-enabled nodes will not benefit from simple mlock().osgx
AMD Phenom(tm) II X4 970 ProcessorGearoid Murphy

3 Answers

90
votes

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

18
votes

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

5
votes

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.