I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.
__global__ void foo( )
{
__shared__ double t;
t = 1;
}
PTX info shows:
ptxas info : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]
But the size of a double is only 8.
More example:
__global__ void foo( )
{
__shared__ int t[1024];
t[0] = 1;
}
ptxas info : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]
Could someone explain why?