Context: CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.
I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)
When is the kernel code loaded into device memory? I guess at execution of the host code line:
kernel<<<geom>>>(params);
Right? And does the code remain in device memory after the call? If so, can I explicitly unload the code?
What concerns me is device memory fragmentation. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?
__global__keyword, it just means anything that's declared in global scope in your .cu files or any memory allocated via cudamalloc. So for example if you haveint8[1024]at global scope in a .cu file or if you ever callcudaMalloc(8 * 1024)without a matching cudaFree you will have an outstanding 1KB memory footprint. - asm