CPU memory access latency of data allocated with malloc() vs. cudaHostAlloc() on Tegra TK1

Question

I'm performing a simple test which is comparing the access latency of data allocated with malloc() and data allocated with cudaHostAlloc() from the host (the cpu is performing the accesses). I noticed that accessing data allocated with cudaHostAlloc() is much slower than accessing data allocated with malloc() on the Jetson Tk1.

This is not the case for discrete GPUs and seems to only be applicable to TK1. After some investigations, I found that data allocated with cudaHostAlloc() is memory mapped (mmap) into /dev/nvmap areas of the process address space. This is not the case for normal malloc'd data which is mapped on the process heap. I understand that this mapping might be necessary to allow the GPU to access the data since cudaHostAlloc'd data has to be visible from both the host and device.

My question is the following: Where does the overhead of accessing cudaHostAlloc'd data from the host come from? Is data mapped to /dev/nvmap uncached on the CPU caches?

mdashti mdashti · Accepted Answer · 2015-01-20T00:31:32

I believe I have found the reason for this behaviour. After further investigations (using Linux trace events and looking at the nvmap driver code) I found that the source of the overhead comes from the fact that data allocated with cudaHostAlloc() are marked "uncacheable" using the NVMAP_HANDLE_UNCACHEABLE flag. A call to pgprot_noncached() is made to insure that the relevant PTEs are marked uncacheable.

The behaviour of host accesses to data allocated using cudaMallocManaged() is different. Data will be cached (using the flag NVMAP_HANDLE_CACHEABLE). Therefore accesses to this data from the host is equivalent to malloc()'d data. It is also important to note that the CUDA runtime does not allow device (GPU) accesses to any data allocated with cudaMallocManaged() concurrently with the host, and such an action would generate a segfault. The runtime, however, allows concurrent accesses to cudaHostAlloc()'d data on both the device and host, and I believe this is one of the reasons for making cudaHostAlloc()'d data uncacheable.

CPU memory access latency of data allocated with malloc() vs. cudaHostAlloc() on Tegra TK1

1 Answers