I declared shared memory and tried to trace it with Nsight 2.2 for visual studio 2010. I'm using CUDA 4.2 with a quadro 5000.
in my kernel.cu:
extern __shared__ ushort2 sampleGatheringSM[];
in my fonction calling the kernel:
sampleGathering_SM_size =dimBlock.x*dimBlock.y*4*sizeof(ushort2)*2; // = 10240
sampleGatheringKernel<<<dimGrid, dimBlock, sampleGathering_SM_size >>>(dev_image, dev_gradient, width, height);
When I look the analisys activity on Nsight then "CUDA Launches", it tells me that:
- Allocated Registers per block: 10240
- Allocated Shared Memory per block: 0
- Block Limit Reason: Registers
Did I allocate shared memory correctly ? I don't understand how I could allocate Register.
EDIT:
it tells me also:
- Register per threads: 32
- Dynamic Shared memory per block: 0
- Static shared memory per block: 0
extern __shared__ ushort2 sampleGatheringSM[]is declared outside the global function in global in the file. - Seltymar