I have some troubles with understanding shared memory organization in CUDA and 3 dumb questions.
1) Cuda c programming guide says "There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors, both of which are used to cache accesses to local or global memory, including temporary register spills". Futher they call L2 cache as "shared memory".
Am I wrong saying that L2 cache shared by all SM and shared memory which is used to store variables with __shared__
qualifier are different things?
2) I have GeForce 630M with Compute Capability 2.1, so the result of following:
cudaDeviceProp info;
cudaGetDeviceProperties(&info,0);
printf("%d kB\n",info.sharedMemPerBlock/1024);
is 48 kB.
Does it mean that it is the total size of shared memory and maximum possible size of shared memory for one block? I mean, for example, I'm about to launch kernel using N blocks, so every block can use 48/N kB only?
3) Is there any difference between
extern __shared__ float shared[];
and
__shared__ float shared[];
?