I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__
memory, no?
The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__
memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__
memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__
memory. It has single block scope & a lifetime of that block (of threads).
I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*. I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks? (I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)