mmap() device memory into user space

Question

Saying if we do a mmap() system call and maps some PCIE device memory (like GPU) into the user space, then application can access those memory region in the device without any OS overhead. Data can by copied from file system buffer directly to device memory without any other copy.

Above statement must be wrong... Can anyone tell me where is the flaw? Thanks!

strace the X11 server (e.g. Xorg) to understand what it is doing (and how it is mmap-ing the GPU). — Basile Starynkevitch
What make you think your statement is wrong? Try also cat /proc/$(pidof /usr/bin/X)/maps .... — Basile Starynkevitch
@BasileStarynkevitch I think it is flawed because I know in CUDA, if you want to copy data to GPU memory, you need first copy the data from the disk to the host memory, then from host memory to the device memory using cudaMemCopy. If the above statement is true, then why does nvidia do those copies? They can do a mmap() and copy directly from disk to gpu device memory. — fyang29
Well, Cuda is probably partly using Xorg, so some data has to flow from your application into Xorg (which feeds the GPU...) — Basile Starynkevitch
@artlessnoise Thanks. It makes sense to me that stream data cann't be memory mapped. But what I want to compare mmap() with is the cudaMemCopy(). Current CUDA programming model is we should first copy data from disk to system memory, then from system memory to gpu memory. For what design concerns lead nvidia making the decision to choose current implementation, rather than the mmap() implementation (maps gpu memory into user space)? — fyang29

artless noise artless noise · Accepted Answer · 2013-12-02T16:05:50

For a normal device what you have said is correct. If the GPU memory behaves differently for reads/write, they might do this. We should look at some documentation of cudaMemcpy().

From Nvidia's basics of CUDA page 22,

direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete. Doesn't start copying until previous CUDA calls complete

It seems pretty clear that the cudaMemcpy() is synchronized to prior GPU registers writes, which may have caused the mmap() memory to be updated. As the GPU pipeline is a pipeline, prior command issues may not have completed when cudaMemcpy() is issued from the CPU.

mmap() device memory into user space

1 Answers