cuda - loop unrolling issue

Question

I have a kernel with a #pragma unroll 80 and I'm running it with NVIDIA GT 285, compute capability 1.3, with grid architecture: dim3 thread_block( 16, 16 ) and dim3 grid( 40 , 30 ) and it works fine.

When I tried running it with NVIDIA GT 580, compute capability 2.0 and with the above grid architecture it works fine.

When I change the grid architecture on the GT 580 to

dim3 thread_block( 32 , 32 ) and dim3 grid( 20 , 15 ), thus producing the same number of threads as above, I get incorrect results.

If I remove #pragma unroll 80 or replace it with #pragma unroll 1 in GT 580 it works fine. If I don't then the kernel crashes.

Would anyone know why does this happen? Thank you in advance

EDIT: checked for kernel errors on both devices and I got the "invalid argument". As I searched for the causes of this error I found that this happens when the dimensions of the grid and the block exceed their limits. But this is not the case for me since I use 16x16=256 threads per block and 40x30=1200 total blocks. As far as I know these values are in the boundaries of the GPU grid for compute capability 1.3. I would like to know if this could have anything to do with the loop unrolling issue I have.

I use error check for my memcopies in my host code and I didn't get any error — user1280671
Can you post the code when you launch the kernel and check the error? you should check the error after launching the kernel and after running a cudaDeviceSynchronize(). Since I didn't see any code to check the boundary cases when accessing dev_feature1. I suspect your code had memory corruption. — w00d
if it helps dev_feature1 is a _____device_____ array with size of 2150400 unsigned chars of which I have copied its values from host with MemcpyToSymbol. What kind of error checking are you refering to? I have the impression you are not talking about memcopy error check — user1280671
if you want help with the invalid argument at kernel launch, you would need to show the kernel invocation line along with the definition and allocation of all the arguments and launch configuration passed to the kernel. — Robert Crovella

user1280671 user1280671 · Accepted Answer · 2013-04-13T09:34:42

I figured out what the problem was.

After some bug fixes I got the "Too Many Resources Requested for Launch" error. For a loop unroll, extra registers per thread are needed and I was running out of registers, hence the error and the kernel fail. I needed 22 registers per thread, and I have 1024 threads per block.

By inserting my data into the CUDA_Occupancy_calculator it showed me that 1 block per SM is scheduled, leaving me with 32678 registers for a whole block on the compute capability 2.0 device.

22 registers*1024 threads = 22528 registers<32678 which should have worked. But I was compiling with nvcc -arch sm_13 using the C.C. 1.3 characteristic of 16384 registers per SM

I compiled with nvcc -arch sm_20 taking advantage of the 32678 registers, more than enough for the needed 22528, and it works fine now. Thanks to everyone, I learned about kernel errors.

cuda - loop unrolling issue

1 Answers