How is CUDA memory managed?

Question

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.

Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?

Can anyone help me answer these questions? Thanks

Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.

Edit 2: The following is what I got after doing some research. Feel free to correct me.

CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.

New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

Can you edit the question to include which operating system, GPU and cuda version you are using, and whether the GPU is a display or non display device. It will have a bearing on the correct answer to your question. — talonmies
To answer the extra questions - user observable fragmentation occurs within a context, and no there is no way to change memory mapping within the GPU, that is all handled by the host driver. — talonmies
As you explain, a context allocation is composed of context static allocation, context user allocation and CUDA context runtime heap. I think the size of context static allocation and context user allocation is pre-decided. Therefore, I think the only cause of memory fragmentation is context runtime heap which is only on Fermi architecture. Is that correct? I guess the system will pre-allocate a chunk of memory for context runtime heap so that the in-kernel dynamic memory allocation is enable. — xhe8
Your question is currently kind of a mess. can you edit it to just have initial backround, then a bunch of questions? — einpoklum

talonmies talonmies · Accepted Answer · 2011-12-31T05:05:35

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory 
              - display driver reservations 
              - CUDA driver reservations
              - CUDA context static allocations (local memory, constant memory, device code)
              - CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
              - CUDA context user allocations (global memory, textures)

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

How is CUDA memory managed?

2 Answers