Global Memory vs. Dynamic Global Memory Allocation in CUDA

Question

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.

For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.

My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?

Robert Crovella Robert Crovella · Accepted Answer · 2013-10-30T12:25:09

Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).

All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).

Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.

Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.

Global Memory vs. Dynamic Global Memory Allocation in CUDA

2 Answers