CUDA variables inside global kernel

Question

My questions are:

1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:

__global__ void addKernel(int *c, const int *a, const int *b)
{
   int i = threadIdx.x;
   int p;
   p = a[i] + b[i];
   c[i] = p;
}

Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.

2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.

3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?

4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?

I didn't downvote you, but you asked several questions at once. Stack Overflow usually expects questions to be clear and focused. — Jared Hoberock

Jez Jez · Accepted Answer · 2014-12-11T18:22:26

1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register memory, though can spill into local memory.

2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.

The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.

This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.

CUDA variables inside global kernel

1 Answers