5
votes

I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.

__global__ void foo( )
{
    __shared__ double t;
    t = 1;
}

PTX info shows:
ptxas info : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]

But the size of a double is only 8.

More example:

__global__ void foo( )
{
    __shared__ int t[1024];
    t[0] = 1;
}

ptxas info : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]

Could someone explain why?

1
@Fr34K: How can run time settings possibly have anything to do with the size of a static assignment made by the compiler and assembler?talonmies
Need more clarification. How many Blocks have you initialized??Fr34K
@talonmies: Thanks for the clarification. I missed that point. :)Fr34K
honestly I've never used statically allocated sharem memory in cuda kernels but this looks strange. Which cuda version do you have ? is it for 64-bits ?user1545642
If I compile your first kernel with the 4.2 release of nvcc, I only get 8 bytes of static shared memory for both compute 1.3 and compute 2.0 targets (ie. both the old open64 and new llvm based compilers). What platform and CUDA version are you using for this?talonmies

1 Answers

1
votes

Seems that the problem has gone in the current CUDA compiler.

__shared__ int a[1024];

compiled with command 'nvcc -m64 -Xptxas -v -ccbin /opt/gcc-4.6.3/bin/g++-4.6.3 shmem.cu' gives

ptxas info    : Used 1 registers, 4112 bytes smem, 4 bytes cmem[1]

There are some shared memory overhead in this case, but the usage is not doubled.