My questions are:
1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each thread. That allows you to store some intermediate result in this variable for every thread. Example: vector c=a+b:
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
int p;
p = a[i] + b[i];
c[i] = p;
}
Here we declare intermediate variable p. But in reality there are N copies of this variable, each one for each thread.
2) Is it true, that if I will declare array, N copies of this array will be created, each for each thread? And as long as everything inside the global kernel happens on gpu memory, you need N times more memory on gpu for any variable declared, where N is the number of your threads.
3) In my current program I have 35*48= 1680 blocks, each block include 32*32=1024 threads. Does it mean, that any variable declared within a global kernel will cost me N=1024*1680=1 720 320 times more than outside the kernel?
4) To use shared memory, I need M times more memory for each variable than usually. Here M is the number of blocks. Is that true?