I have a kernel with a #pragma unroll 80
and I'm running it with NVIDIA GT 285, compute capability 1.3,
with grid architecture: dim3 thread_block( 16, 16 )
and dim3 grid( 40 , 30 )
and it works fine.
When I tried running it with NVIDIA GT 580, compute capability 2.0 and with the above grid architecture it works fine.
When I change the grid architecture on the GT 580 to
dim3 thread_block( 32 , 32 )
and dim3 grid( 20 , 15 )
, thus producing the same number of threads as above, I get incorrect results.
If I remove #pragma unroll 80
or replace it with #pragma unroll 1
in GT 580 it works fine. If I don't then the kernel crashes.
Would anyone know why does this happen? Thank you in advance
EDIT: checked for kernel errors on both devices and I got the "invalid argument". As I searched for the causes of this error I found that this happens when the dimensions of the grid and the block exceed their limits. But this is not the case for me since I use 16x16=256 threads per block and 40x30=1200 total blocks. As far as I know these values are in the boundaries of the GPU grid for compute capability 1.3. I would like to know if this could have anything to do with the loop unrolling issue I have.