Device memory for CUDA kernel code: Is it explicitly manageable?

Question

Context: CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.

I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)

When is the kernel code loaded into device memory? I guess at execution of the host code line:

kernel<<<geom>>>(params);

Right? And does the code remain in device memory after the call? If so, can I explicitly unload the code?

What concerns me is device memory fragmentation. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?

Are you using any global memory? Global memory will stay allocated until you close the device or cuda free it(if it was cudamalloc'ed) — asm
The kernels are defined as "global", but nothing else is. — ritter
Global here doesn't mean the __global__ keyword, it just means anything that's declared in global scope in your .cu files or any memory allocated via cudamalloc. So for example if you have int8[1024] at global scope in a .cu file or if you ever call cudaMalloc(8 * 1024) without a matching cudaFree you will have an outstanding 1KB memory footprint. — asm

talonmies talonmies · Accepted Answer · 2011-08-22T11:16:10

The memory allocation you are observing is used by the CUDA context. It doesn't only hold kernel code, it holds any other static scope device symbols, textures, per-thread scratch space for local memory, printf and heap, constant memory, as well as gpu memory required by the driver and CUDA runtime itself. Most of this memory is only ever allocated once, when a binary module is loaded, or PTX code is JIT compiled by the driver. It is probably best to think of it as a fixed overhead, rather than a leak. There is a 2 million instruction limit in PTX code, and current hardware uses 32 bit words for instructions, so the memory footprint of even the largest permissible kernel code is small compared to the other global memory overheads it requires.

In recent versions of CUDA there is a runtime API call cudaDeviceSetLimit which permits some control over the amount of scratch space a given context can consume. Be aware that it is possible to set the limits to values which are lower than the device code requires, in which case runtime execution failures can result.

Device memory for CUDA kernel code: Is it explicitly manageable?

1 Answers