10
votes

I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.

I am trying to run a kernel where my input data is more than the amount available on the GPU.

Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?

I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?

Is this typical usage scenario correct for using cudaMallocHost:

  1. allocate pinned host memory (lets call it "h_p")
  2. populate h_p with input data-
  3. get device pointer on GPU for h_p
  4. run kernel using that device pointer to modify contents of array-
  5. use h_p like normal, which now has modified contents-

So - no copy has to happy between step 4 and 5 right?

if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least

5
you seem to be asking several questions...jmilloy
@Derek In order to avoid copies when using non-pagable memory (also known as pinned memory) in the host with cudaHostAlloc() you just have to use the flag cudaHostAllocMapped instead of cudaHostAllocDefault when allocating. In that way you can access the host memory directly from within CUDA C kernels. This is known as zero-copy memory. Pinned memory is also like a double-edge sword, the computer running the application needs to have available physical memory for every page-locked buffer, since these buffers can never be swapped out to disk but this leads to faster memory running out.BugShotGG

5 Answers

6
votes

Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost can do two things:

  • allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in cudaMemcpy as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.
  • allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the cudaDeviceMapHost flag using cudaSetDeviceFlags before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.

I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.

If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.

I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.

1
votes

Using host memory would be orders of magnitude slower than on-device memory. It has both very high latency and very limited throughput. For example capacity of PCIe x16 is mere 8GB/s when bandwidth of device memory on GTX460 is 108GB/s

1
votes

Neither the CUDA C Programming Guide, nor the CUDA Best Practices Guide mention that the amount allocated by cudaMallocHost can 't be bigger than the device memory so I conclude it's possible.

Data transfers from page locked memory to the device are faster than normal data transfers and even faster if using write-combined memory. Also, the memory allocated this way can be mapped into device memory space eliminating the need to (manually) copy the data at all. It happens automatic as the data is needed so you should be able to process more data than fits into device memory.

However, system performance (of the host) can greatly suffer, if the page-locked amount makes up a significant part of the host memory.

So when to use this technique?, simple: If the data needs be read only once and written only once, use it. It will yield a performance gain, since one would've to copy data back and forth at some point anyway. But as soon as the need to store intermediate results, that don't fit into registers or shared memory, arises, process chunks of your data that fit into device memory with cudaMalloc.

0
votes
  1. Yes, you can cudaMallocHost more space than there is on the gpu.
  2. Pinned memory can have higher bandwidth, but can decrease host performance. It is very easy to switch between normal host memory, pinned memory, write-combined memory, and even mapped (zero-copy) memory. Why don't you use normal host memory first and compare the performance?
  3. Yes, your usage scenario should work.

Keep in mind that global device memory access is slow, and zero-copy host memory access is even slower. Whether zero-copy is right for you depends entirely on how you use the memory.

0
votes

Also consider use of streams for overlapping data transfer/ kernel execution. This provides gpu work on chunks of data