It is not really "stuck". It is just trapped in a hell of attempts to optimize the kernel. Primarily by unrolling the loops with the fixed size (and BTW, by finding out that the foo variable is not used at all!)
For example, when the loops a...d are enabled (and e switched off), then the binaries that are created for the kernel look like this:
.entry foobar(
.param .u32 .ptr .global .align 4 foobar_param_0
)
{
.reg .pred %p<4>;
.reg .s32 %r<13>;
mov.u32 %r10, 0;
BB0_1:
add.s32 %r10, %r10, 1;
mov.u32 %r11, 0;
BB0_2:
mov.u32 %r12, 10;
BB0_3:
add.s32 %r12, %r12, -2;
setp.ne.s32 %p1, %r12, 0;
@%p1 bra BB0_3;
add.s32 %r11, %r11, 1;
setp.ne.s32 %p2, %r11, 10;
@%p2 bra BB0_2;
setp.ne.s32 %p3, %r10, 10;
@%p3 bra BB0_1;
ret;
}
You can see that it is not really computing anyhting - and the compiler already has a hard time finding out that there is actually nothing to do.
Compare this to the output that is generated when you add the line
notusedvariable[0]=foo;
as the last line of the kernel: Now, the computations can not be skipped and optimized away. After quite a while of compiling, it produces the result
.entry foobar(
.param .u32 .ptr .global .align 4 foobar_param_0
)
{
.reg .pred %p<4>;
.reg .s32 %r<80>;
mov.u32 %r79, 1;
mov.u32 %r73, 0;
mov.u32 %r72, %r73;
BB0_1:
add.s32 %r7, %r73, 1;
add.s32 %r72, %r72, 2;
mov.u32 %r76, 0;
mov.u32 %r74, %r76;
mov.u32 %r73, %r7;
mov.u32 %r75, %r7;
BB0_2:
mov.u32 %r9, %r75;
add.s32 %r74, %r74, %r72;
mov.u32 %r78, 10;
mov.u32 %r77, 0;
BB0_3:
add.s32 %r40, %r9, %r77;
mul.lo.s32 %r41, %r40, %r79;
mul.lo.s32 %r42, %r40, %r41;
add.s32 %r43, %r74, %r77;
mul.lo.s32 %r53, %r42, %r40;
mul.lo.s32 %r54, %r53, %r40;
mul.lo.s32 %r55, %r54, %r40;
mul.lo.s32 %r56, %r55, %r40;
mul.lo.s32 %r57, %r56, %r40;
mul.lo.s32 %r58, %r57, %r40;
mul.lo.s32 %r59, %r58, %r40;
mul.lo.s32 %r60, %r59, %r40;
mul.lo.s32 %r61, %r60, %r43;
mul.lo.s32 %r62, %r61, %r43;
mul.lo.s32 %r63, %r62, %r43;
mul.lo.s32 %r64, %r63, %r43;
mul.lo.s32 %r65, %r64, %r43;
mul.lo.s32 %r66, %r65, %r43;
mul.lo.s32 %r67, %r66, %r43;
mul.lo.s32 %r68, %r67, %r43;
mul.lo.s32 %r69, %r68, %r43;
mul.lo.s32 %r70, %r69, %r43;
mul.lo.s32 %r79, %r70, -180289536;
add.s32 %r77, %r77, %r74;
add.s32 %r78, %r78, -2;
setp.ne.s32 %p1, %r78, 0;
@%p1 bra BB0_3;
add.s32 %r76, %r76, 1;
add.s32 %r30, %r9, %r7;
setp.ne.s32 %p2, %r76, 10;
mov.u32 %r75, %r30;
@%p2 bra BB0_2;
setp.ne.s32 %p3, %r7, 10;
@%p3 bra BB0_1;
ld.param.u32 %r71, [foobar_param_0];
st.global.u32 [%r71], %r79;
ret;
}
Obviously, it has unrolled some of the loops, now that he could not optimize them away any more. I assume that when loop "e" is also activated, the time that is required for this sort of unrolling (or for optimizing away the unused loops) increases at least quadratically. So if you give him a few hours, he might actually finish the compilation as well....
As Tom Fenech already said in https://stackoverflow.com/a/22011454 , this problem can be alleviated by passing -cl-opt-disable to clBuildProgram.
Alternatively, you can selectively switch off the unrolling optimization for each loop: When you insert
#pragma unroll 1
directly before a for-loop, you are effectively disabling unrolling for this particular loop.
Important Don't blindly insert the unroll pragma with arbitrary values. Using 1 is safe, but for other values, you have to manually make sure that it does not affect the correctness of the program. See the CUDA programming guide, section "B.21. #pragma unroll".
In this case, it seems to be sufficient to insert #pragma unroll 1 before the two innermost loops (d and e) in order to enable enough of the optimization to quickly build the program.
EDIT: sigh prunge was 4 minutes faster... :-(