What is the memory requirement for a CUDA kernel execution?

Question

I am executing an 320*320 array multiplication using CUDA on a gpu. I have observed that a fixed amount of memory is used which is unaccounted for.For example in 640*640 array multiplication considering each element occupies 4 bytes and we have 3 such arrays in the code, approx 5 MB of GPU memory should be consumed. But when I check it with the nvidia-smi command it shows me 53MB as consumed. This 48 MB is unaccounted for. The same case is true for 1200*1200 or any other possible size.

I have calculated the stats for array multiplication program only. — user3300239
the unaccounted for is growing? whtats the total memory used when you perform your calculation with the 320x320 matrix? — user2076694
Its 50 MB. where as the total size of the arrays come out to be approx 1.2 MB. I assume that this 48 MB must be used by CUDA kernel for instruction and code storage. — user3300239
Just by curiosity could just run a code with an empty kernel and check the memory usage. And check your GPU memory usage when nothing is running, if your GPU is used to display your screen for your laptop or your computer, the 48MB might be used or reserved for the displaying. — user2076694
If you want to use nvidia-smi simply add a CPU breakpoint in before or after the launch or call getc, system("pause"), or add a timing loop into the code. — Greg Smith

Greg Smith Greg Smith · Accepted Answer · 2014-04-04T14:27:29

The CUDA driver maintains numerous device memory allocations including but not limited to

Local Memory
- Size = (user specified lmem size per thread + driver specified syscall stack) * MultiprocessorCount * MaxThreadsPerMultiprocessor.
- Example - 15 SM GK110
  - 15 Multiprocessors
  - 2048 MaxThreadsPerMultiprocessor
  - 2048 bytes per thread (cudaLimitStackSize)
  - 512 bytes per thread for syscall stack
  - SIZE = 15 * 2048 * (2048 + 512) = 78,643,200 bytes
Printf FIFO
Malloc Heap
Constant Buffers
- Driver allocates multiple constant buffers per stream. These are used to pass launch configuration and launch parameters, module constants, and constant variables. The PTX manual has additional information on constant buffers.
CUDA Dynamic Parallelism Buffers

The driver defers creation of these buffers until necessary. This often means that the memory allocation will be done in one of the API calls to launch a kernel.

Items 1, 2, and 3 can be controlled to some extent through cudaDeviceSetLimit.

Item 4 grows linearly with number of streams allocated and modules loaded. At a different point for each architecture the driver will start aliasing stream constant buffers to limit the resource allocations.

What is the memory requirement for a CUDA kernel execution?

1 Answers