0
votes

everyone.

I got this kernel:

__kernel void FuncionCL(__global char* in, __global char* out, __global int* S2)
{
    __private int op1, op2, op3;
    __private int C;
    __private uint WorkDim, C2;
    op1 = 1;
    op2 = 2;
    WorkDim = get_global_size(0);
    __private int ID;
    ID = get_global_id(0);
    for(C = 0; C < 1000000; C++)
    {
        for(C2 = ID; C2 < 1000; C2 += WorkDim)
        {
            op3 = op1 + op2;
        }
    }
    out[0] = 90;
    out[1] = 89;
    *S2 = (int) WorkDim;
}

It crashes not only the application, the graphic controller too. I i change the for increment for the constant value '16' (the get_global_size() function returns) then the code runs fine. What's the problem?

If i run the code with:

WorkDim = 16;

in the line 8 instead of:

WorkDim = get_global_size(0);

The code runs 400 times faster, that's the problem. Why if the value is the same?

**EDIT: ** Well, now i know why, the code is so slow and there are multiply reasons:

1.- The occupancy.

2.- All the threads do the same iterations in the first loop, the right code looks like this:

__kernel void FuncionCL(__global char* in, __global char* out, __global int* S2) { __private int op1, op2, op3; __private int C; __private uint WorkDim, C2; op1 = 1; op2 = 2; WorkDim = get_global_size(0); __private int ID; ID = get_global_id(0); for(C = ID; C < 1000000; C += WorkDim) { for(C2 = ID; C2 < 1000; C2 += WorkDim) { op3 = op1 + op2; } } out[0] = 90; out[1] = 89; *S2 = (int) WorkDim; }

Now my code runs 6.1 times faster on the GPU than CPU.

1
Does it crash instantly or after some time (like 5 to 10 seconds)? If the latter, it is the OS GPU watchdog timer resetting the GPU because the kernel is taking too long. Ideally you want your kernels to take less than 50 ms or they start affecting system responsiveness (since they are also used for UI drawing, etc.). - Dithermaster
Thanks for your answer. Yes, it crash after 5 seconds. But the question is why? It looks like get_global_size(0) return 0 and the loop never ends, but, if i use a constant value incrment instead of the value in WorkDim, the kernel returns '16' ( get_global_size(0) espected value). - El Benjo

1 Answers

0
votes

Each item there is doing 1000000*1000 = 1Gop. Just too much, takes too long to do that and the driver restarts the GPU. (I am guessing global size is 1 in your example)

It is a total waste of resources to run a CL kernel with so little work items, it will make the GPU do almost-serial computation and take too long.

At least 1024 global items are needed in new GPUs to fully use their resources.

EDIT: The loop is probably optimized by the compiler when it has a static value. Therefore giving an "amazing" speedup.