I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount available on the GPU.
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
Is this typical usage scenario correct for using cudaMallocHost:
- allocate pinned host memory (lets call it "h_p")
- populate h_p with input data-
- get device pointer on GPU for h_p
- run kernel using that device pointer to modify contents of array-
- use h_p like normal, which now has modified contents-
So - no copy has to happy between step 4 and 5 right?
if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least
cudaHostAlloc()
you just have to use the flagcudaHostAllocMapped
instead ofcudaHostAllocDefault
when allocating. In that way you can access the host memory directly from within CUDA C kernels. This is known aszero-copy memory
. Pinned memory is also like a double-edge sword, the computer running the application needs to have available physical memory for every page-locked buffer, since these buffers can never be swapped out to disk but this leads to faster memory running out. – BugShotGG