clBuildProgram gets stuck with nested loops

Question

clBuildProgram seems to get stuck without any error message when trying this kind of .cl-file:

__local int bar(int a, int b, int c, int d, int e)
{
    return a*b*c*d; // 'e' not used
}

__kernel void foobar(__global int * notusedvariable)
{
    int foo=1;
    for (int a=1; a<=10; a++)
        for (int b=1; b<=10; b++)
            for (int c=1; c<=10; c++)
                for (int d=1; d<=10; d++)
                     for (int e=1; e<=10; e++)
                         foo *= bar(a,b,c,d,e);
}

When I remove innermost loop and change foo *= bar(a,b,c,d,e); to foo *= bar(a,b,c,d,1); it compiles. So there is some kind of over-optimization or over-precalculation going on. This also happens if I have more loops and some of the variables are taken from get_global_id(...).

What can I do?

I use Fedora Linux 20, and have installed

opencl-utils-0-12.svn16.fc20.x86_64
opencl-1.2-intel-cpu-3.2.1.16712-1.x86_64
opencl-utils-devel-0-12.svn16.fc20.x86_64
opencl-1.2-base-3.2.1.16712-1.x86_64

GPU is Geforce 210, i.e. the cheapest one that I could find.

Marco13 Marco13 · Accepted Answer · 2014-02-25T11:15:14

It is not really "stuck". It is just trapped in a hell of attempts to optimize the kernel. Primarily by unrolling the loops with the fixed size (and BTW, by finding out that the foo variable is not used at all!)

For example, when the loops a...d are enabled (and e switched off), then the binaries that are created for the kernel look like this:

.entry foobar(
    .param .u32 .ptr .global .align 4 foobar_param_0
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<13>;


    mov.u32     %r10, 0;

BB0_1:
    add.s32     %r10, %r10, 1;
    mov.u32     %r11, 0;

BB0_2:
    mov.u32     %r12, 10;

BB0_3:
    add.s32     %r12, %r12, -2;
    setp.ne.s32     %p1, %r12, 0;
    @%p1 bra    BB0_3;

    add.s32     %r11, %r11, 1;
    setp.ne.s32     %p2, %r11, 10;
    @%p2 bra    BB0_2;

    setp.ne.s32     %p3, %r10, 10;
    @%p3 bra    BB0_1;

    ret;
}

You can see that it is not really computing anyhting - and the compiler already has a hard time finding out that there is actually nothing to do.

Compare this to the output that is generated when you add the line

notusedvariable[0]=foo;

as the last line of the kernel: Now, the computations can not be skipped and optimized away. After quite a while of compiling, it produces the result

.entry foobar(
    .param .u32 .ptr .global .align 4 foobar_param_0
)
{
    .reg .pred  %p<4>;
    .reg .s32   %r<80>;


    mov.u32     %r79, 1;
    mov.u32     %r73, 0;
    mov.u32     %r72, %r73;

BB0_1:
    add.s32     %r7, %r73, 1;
    add.s32     %r72, %r72, 2;
    mov.u32     %r76, 0;
    mov.u32     %r74, %r76;
    mov.u32     %r73, %r7;
    mov.u32     %r75, %r7;

BB0_2:
    mov.u32     %r9, %r75;
    add.s32     %r74, %r74, %r72;
    mov.u32     %r78, 10;
    mov.u32     %r77, 0;

BB0_3:
    add.s32     %r40, %r9, %r77;
    mul.lo.s32  %r41, %r40, %r79;
    mul.lo.s32  %r42, %r40, %r41;
    add.s32     %r43, %r74, %r77;
    mul.lo.s32  %r53, %r42, %r40;
    mul.lo.s32  %r54, %r53, %r40;
    mul.lo.s32  %r55, %r54, %r40;
    mul.lo.s32  %r56, %r55, %r40;
    mul.lo.s32  %r57, %r56, %r40;
    mul.lo.s32  %r58, %r57, %r40;
    mul.lo.s32  %r59, %r58, %r40;
    mul.lo.s32  %r60, %r59, %r40;
    mul.lo.s32  %r61, %r60, %r43;
    mul.lo.s32  %r62, %r61, %r43;
    mul.lo.s32  %r63, %r62, %r43;
    mul.lo.s32  %r64, %r63, %r43;
    mul.lo.s32  %r65, %r64, %r43;
    mul.lo.s32  %r66, %r65, %r43;
    mul.lo.s32  %r67, %r66, %r43;
    mul.lo.s32  %r68, %r67, %r43;
    mul.lo.s32  %r69, %r68, %r43;
    mul.lo.s32  %r70, %r69, %r43;
    mul.lo.s32  %r79, %r70, -180289536;
    add.s32     %r77, %r77, %r74;
    add.s32     %r78, %r78, -2;
    setp.ne.s32     %p1, %r78, 0;
    @%p1 bra    BB0_3;

    add.s32     %r76, %r76, 1;
    add.s32     %r30, %r9, %r7;
    setp.ne.s32     %p2, %r76, 10;
    mov.u32     %r75, %r30;
    @%p2 bra    BB0_2;

    setp.ne.s32     %p3, %r7, 10;
    @%p3 bra    BB0_1;

    ld.param.u32    %r71, [foobar_param_0];
    st.global.u32   [%r71], %r79;
    ret;
}

Obviously, it has unrolled some of the loops, now that he could not optimize them away any more. I assume that when loop "e" is also activated, the time that is required for this sort of unrolling (or for optimizing away the unused loops) increases at least quadratically. So if you give him a few hours, he might actually finish the compilation as well....

As Tom Fenech already said in https://stackoverflow.com/a/22011454 , this problem can be alleviated by passing -cl-opt-disable to clBuildProgram.

Alternatively, you can selectively switch off the unrolling optimization for each loop: When you insert

#pragma unroll 1

directly before a for-loop, you are effectively disabling unrolling for this particular loop.

Important Don't blindly insert the unroll pragma with arbitrary values. Using 1 is safe, but for other values, you have to manually make sure that it does not affect the correctness of the program. See the CUDA programming guide, section "B.21. #pragma unroll".

In this case, it seems to be sufficient to insert #pragma unroll 1 before the two innermost loops (d and e) in order to enable enough of the optimization to quickly build the program.

EDIT: sigh prunge was 4 minutes faster... :-(

clBuildProgram gets stuck with nested loops

3 Answers

edit