declaring shared memory variables in a kernel

Question

I have a question about how shared variables work.

When I declare a shared variable in a kernel like this __shared__ int array1[N] every unique shared memory of each active block now has an instance of array1 with size N. Meaning that every shared memory of each active block has now allocated N*sizeof(int) bytes. And N*sizeof(int) must be at most 16KB for a gpu with compute capability 1.3.

So, assuming the above is correct and using 2D threads and 2D blocks assigned at host like this:

dim3 block_size(22,22);
dim3 grid_size(25,25);

I would have 25x25 instances of array1 with size N*sizeof(int) each and the most threads that could access each shared memory of a block is 22x22. This was my original question and it was answered.

Q: When I assign a value to array1

array1[0]=1;

then do all active blocks assign that value instantly at their own shared memory?

Since I confirmed my assumption I wanted to ask that when I use 2D threads and 2D blocks declared on host like this dim3 block_size(22,22); dim3 grid_size(25,25); how many instances of array1 to I have. 25 or 25*25? And how many threads can access each one of them? Thank you — user1280671
There will be (25 x 25) blocks, therefore total 625 separate instances of shared memory each one for each block. The shared memory of a block can be accessed only by the threads of that block. So (22 x 22) = 484 threads will be able to access each shared memory instance. — sgarizvi
@sgar91: there are never that many separate instance of a static shared memory allocation. There are only ever as many as there are concurrent blocks per SM times the number of SM running the kernel. — talonmies
And N*sizeof(int) can be bigger than 16KB on Fermi and Kepler if the L1/shared split is adjusted. — Robert Crovella

CygnusX1 CygnusX1 · Accepted Answer · 2013-02-14T14:35:16

Each block will always allocate its own shared memory array. So, if you launch 25x25 blocks, you will ultimately create 25x25 arrays in shared memory.

It does not mean, however, that all those arrays will exist at the same time, because it is not guaranteed that all blocks exist at the same time. Number of active blocks depends on the actual model of the GPU it is being run on. The GPU driver will try to launch as many as possible and the extra blocsk will run after previous ones end their work.

The maximum of N*sizeof(int) depends on Compute Capaiblity of your card and the L1-cache configuration. It can vary between: 8KB, 16KB, 32KB and 48KB.

To answer your last question - each shared array is visible by all threads belonging to the corresponding block. In your case each shared array will be visible by the corresponding 22x22 threads.

declaring shared memory variables in a kernel

1 Answers