I'm performing a simple test which is comparing the access latency of data allocated with malloc() and data allocated with cudaHostAlloc() from the host (the cpu is performing the accesses). I noticed that accessing data allocated with cudaHostAlloc() is much slower than accessing data allocated with malloc() on the Jetson Tk1.
This is not the case for discrete GPUs and seems to only be applicable to TK1. After some investigations, I found that data allocated with cudaHostAlloc() is memory mapped (mmap) into /dev/nvmap areas of the process address space. This is not the case for normal malloc'd data which is mapped on the process heap. I understand that this mapping might be necessary to allow the GPU to access the data since cudaHostAlloc'd data has to be visible from both the host and device.
My question is the following: Where does the overhead of accessing cudaHostAlloc'd data from the host come from? Is data mapped to /dev/nvmap uncached on the CPU caches?