4
votes

I am having trouble understanding the occupancy calculator. I am having trouble with some development code where 512 threads works fine, but 1024 threads gives crappy numbers.

I am running Tesla C2050 on windows 7, developing in Matlab (its not my fault i have to use Matlab) and Mexfunction.

I thought i would play around with the occupancy calculator to try to find any other restrictions on my code that was affecting the results.

When i enter 1024 threads per block, there is 0% occupancy. With 512 threads, the occupancy is 33%. I would have thought that i would get at least something with 1024 threads. I have noted that the code and the occupancy calculator gives good results for a maximum of 704 threads (This is a number that doesn't represent anything real).

I believe my lack of understanding on this area is the reason i can not correct the error i am seeing in the code. Can anyone explain why i'm getting these results?

The numbers are:

  • compute capability 2.0
  • shared memory size 49152
  • threads per block 512 or 1024
  • registers per thread 44
  • shared memory per block 0

ptxas info : Used 44 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]

1

1 Answers

4
votes

The total number of registers you have per block is 32768 (you could check this with deviceQuery in the SDK). Now according to your kernel it uses 44 registers / thread. If you launch the kernel with 1024 thread per block you will get a total of 44*1024 = 45056 registers which is above the limit. In ordet to run it with 1024 thread per block you will need to optimize your kernel to use not more than 32 registers per thread.