1
votes

I declared shared memory and tried to trace it with Nsight 2.2 for visual studio 2010. I'm using CUDA 4.2 with a quadro 5000.

in my kernel.cu:

extern __shared__ ushort2 sampleGatheringSM[];

in my fonction calling the kernel:

sampleGathering_SM_size =dimBlock.x*dimBlock.y*4*sizeof(ushort2)*2; // = 10240
sampleGatheringKernel<<<dimGrid, dimBlock, sampleGathering_SM_size >>>(dev_image, dev_gradient, width, height);

When I look the analisys activity on Nsight then "CUDA Launches", it tells me that:

  • Allocated Registers per block: 10240
  • Allocated Shared Memory per block: 0
  • Block Limit Reason: Registers

Did I allocate shared memory correctly ? I don't understand how I could allocate Register.

EDIT:

it tells me also:

  • Register per threads: 32
  • Dynamic Shared memory per block: 0
  • Static shared memory per block: 0
1
The declaration of sampleGatheringSM[] appears to be correct (missing the full code to verify). The CUDA Launches table should show Registers per Thread (0-63), Static Shared Memory per Block = ? (insufficient detail in description), Dynamic Shared Memory per Block (10240) based upon the 3rd argument in <<<>>>. In which UI do you see the terms "Allocated Registers per block"? The only location I'm aware of is the CUDA Occupancy pane has the row Registers/Block. I will file a bug to the team to investigate the issue. - Greg Smith
The Allocated Registers Per Block is a far column on the right. Sorry I forgot the value was repeated twice. I've reproduced the problem locally. - Greg Smith
The Allocated Registers per Block is correct. dimBlock.x * dimBlock.y * 32 registers per thread is 10240. On an internal version of Nsight I can reproduce Dynamic Shared Memory = 0 for a CUDA trace activity. The correct value is reported for a CUDA profile activity. Static shared memory per block is correct for both activities. - Greg Smith
@GregSmith When you said " CUDA profile activity", do you mean that you used visual profiler instead of Nsight ? So the bug is from Nsight ? As I didn't use Static Shared memory, I couldn't test it. I don't know if it matters but extern __shared__ ushort2 sampleGatheringSM[] is declared outside the global function in global in the file. - Seltymar
The CUDA Profile activity is an Analysis activity option found on the Analysis Activity page. - Jeff Davis

1 Answers

1
votes

The declaration of dynamic shared memory is correct. Nsight 2.2 Analysis Trace Report has a bug that only occurs for CUDA Trace Activities. Analysis Trace Activities run with the option Nsight | Options| Analysis | CUDA Kernel Trace Mode = Serialized and Analysis Profiler CUDA Activities display the correct value. This bug will be fixed in the next version of Nsight.