Let me begin by pointing out that shared memory is, first and foremost, an abstraction of the programming model through which a certain feature of the hardware (fast, on-chip memory) is exposed. In the CUDA programming model, every block in a grid (kernel launch) gets the same amount of shared memory. How much that is depends on the amount of statically allocated shared memory required by the kernel function as well as any additional dynamic shared memory specified in the kernel launch.
- does every SM have the same amount of shared memory within the same GPU?
Yes, that is currently the case. However, this is not really as relevant for the way you program CUDA as you might think, because:
- How does an SM partition the shared memory amongst the blocks? Is it distributed equally ( ex. if there are 2 blocks, then each block gets half the shared memory within the SM regardless of how much is actually used ), or is it based on the needs?
When you launch a kernel, you specify how much shared memory each block needs. This then informs how many blocks can fit on each multiprocessor. So it's not that the number of blocks defines how much shared memory each block gets, but the other way around: the amount of shared memory needed per block is one of the factors that define how many blocks can reside on each multiprocessor.
You will want to read up on latency hiding and occupancy as those are quite fundamental topics when it comes to GPU programming. For more details on the memory subsystems of different GPU architectures, have a look at the CUDA Programming Guide.
- My understanding of a shared memory bank is: shared memory is divided into 32 equally large memory banks. So does this mean per block ( i.e. eveyr block has their own 32 banks ) or is it per SM?
In the end, due to the SIMD (SIMT) nature of GPU cores, the actual program execution happens in warps. When such a warp (currently, that effectively means a group of 32 threads) performs a shared memory access, bank conflicts will be an issue as the shared memory request generated by that instruction is served. It is not really documented whether shared memory requests for multiple warps can be served in parallel. My guess would be that there is only one unit to handle shared memory requests per SM and, thus, the answer is no.
- If I perform a CudaMemcpy from / into shared memroy of more than one word, does this count as a single transaction or multiple transactions? And could this cause bank conflicts?
You cannot cudaMemcpy() into shared memory. Shared memory is only accessible to device threads of the same block and it only persists for as long as that block is running.