0
votes

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:

float test[20000]
float test2[20000]

With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).

The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.

Where is this memory coming from? and how can CUDA perform the computation in every thread?

Thank you for your time

1
To give a complete answer to this question, it will be necessary to know which actual GPU you are running on. The allocation does not happen as you suppose, but will actually be done partly conditioned by the hardware specifics (e.g. # of SMs) that your GPU device has. You may also want to refer to this question/answer.Robert Crovella
@RobertCrovella My card is an Nvidia gforce 650m (compute capability 3.0).Dargor
You can also run "achieved occupancy" and "memory statistics" in CUDA profiler to see what is actually happenning inside the deviceDiligent Key Presser
@MadSorcerer I have tried to use the Nvidia Profiler but pycuda does not work very well with that as when multiple kernel calls are performed, some of them returns a SYS exit that is not zero and the Profiler crashes. This is a pycuda known issue.Dargor
Well my guess is that CUDA runs your grid by portions small enough to have enough local memory for all running threads at once. Never used pyCUDA though, so cannot tell how to check that. I would suggest re-writing the kernel in CUDA/C++ cause NSIGHT profiler works perfectly with it.Diligent Key Presser

1 Answers

4
votes

The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by @njuffa here.

Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.

Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:

  1. The maximum local memory size (512KB for cc2.x and higher)
  2. available GPU memory/(#of SMs)/(max threads per SM)

For your first case:

float test[20000];
float test2[20000];

That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?

GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:

2GB/2/2048 = 512KB

(kepler has 2048 max threads per multiprocessor) so it is the same limit in this case. But this assumes all the GPU memory is available.

Since you're suggesting in the comments that this configuration fails:

float test[40000];
float test2[40000];

i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.

You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.

This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.