Optimizing kernel code in opencl for a GPU

Question

As of now, my GPU is slower than my CPU when it comes to kernel execution time. I thought maybe since I was testing with a small sample, the CPU ended up finishing faster because of a smaller startup overhead. However, when I tested the kernel with data almost 10 times the size of the sample, the CPU was still finishing faster and the GPU was almost 400ms behind.

Runtime with 2.39MB file CPU: 43.511ms GPU: 65.219ms

Runtime with 32.9MB file CPU: 289.541ms GPU: 605.400ms

I tried using local memory, although I'm 100% sure I was using it wrong, and ran into two issues. The kernel finishes anywhere between 1000-3000ms (depending on what size I set for localWorkSize) or I run into a status code of -5, which is CL_OUT_OF_RESOURCES.

Here is the kernel that a fellow SO member helped me out with.

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=Array[i+globalId-64]*coefficients[64-i];    

    }

    sum += tmp;

}
Output[globalId]=sum;
}

This was my attempt at using local memory. First bit will be a snippet from the host code and the following portion is the kernel.

//Set the size of localMem
status |= clSetKernelArg(
    kernel,
    2,
    1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements)
    null);
printf("Kernel Arg output status: %i \n", status);

//set a localWorkSize
localWorkSize[0] = 64;

//execute the kernel with localWorkSize included
status = clEnqueueNDRangeKernel(
    cmdQueue,
    kernel,
    1,
    NULL,
    globalWorkSize,
    localWorkSize,
    0,
    NULL,
    &someEvent);


 //Here is what I did to the kernel*************************************** 
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) {

int globalId = get_global_id(0);
int localId = get_local_id(0);  

localMem[localId] = globalId[globalId];

float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=localMem[i+localId-64]*coefficients[64-i];  

    }

    sum += tmp;

}
Output[globalId]=sum;
}

Reference link I used when trying to set local variables: How do I use local memory in OpenCL?

Link used to find kernelWorkGroupSize (this is why I have 1024 set in the kernelArg): CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

I've seen other people have similar problems where the GPU is slower than the CPU but for many of them, they are using clEnqueueKernel instead of clEnqueueNDRangeKernel.

Heres my previous question if you need more info on this kernel: Best approach to FIFO implementation in a kernel OpenCL

Found some optimization tricks for GPU's aswell. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf

Edited code; Error still exists

__kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
float tmp=0.0f;
for (int i=64-globalId; i< 65; i++)
{

tmp = 0.0f;
tmp=Array[i]*coefficients[i];    
sum += tmp;

}
Output[globalId]=sum;
}

I'm pretty sure you really don't want an if()-statement in your inner for-loop. A smart compiler may be able to hoist the if out of the loop, but a gpu-driver probably doesn't have the time or the smarts to do this efficiently. — EOF
@EOF I am going to take a look into switch statements as an alternative to if(). — VedhaR
@mfa I am trying to implement a lowpass fir filter. The filter works great, it's just a matter of reducing the kernel execution time now. — VedhaR
@VedhaR: Why on earth would you need a switch-statement? That's probably even worse. Let's take a simple look at the conditional, shall we? if (globalId+i > 63) is quite obviously equivalent to if (globalId+i >= 64), which is obviously equivalent to if (i >= 64 - globalId), which is easily eliminated by changing the for-loops initialization statement: for (int i=64-globalId; i< 65; i++). — EOF

huseyin tugrul buyukisik huseyin tugrul buyukisik · Accepted Answer · 2016-05-31T11:17:55

Running the following kernel for 24 million element arrays

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {

int globalId = get_global_id(0); 
float sum=0.0f;
for (int i=0; i< 65; i++)
{
    float tmp=0;
    if (globalId+i > 63)
    {
        tmp=Array[i+globalId-64]*coefficients[64-i];    

    }

    sum += tmp;

}
Output[globalId]=sum;
}

is completed under 200 ms for a 25 compute unit device pool but over 500 ms for a 8 core cpu.

Either you have a high-end cpu and a low-end gpu or the gpu driver has been gimped or gpu's pci-e interface is stuck at pci-e 1.1 @ 4x bandwidth so array copies between host and device is limited.

On the other hand, this optimized version:

__kernel void lowpass(__global __read_only float *Array,__constant  float *coefficients, __global __write_only float *Output) {

        int globalId = get_global_id(0); 
        float sum=0.0f;
        int min_i= max(64,globalId)-64;
        int max_i= min_i+65;
        for (int i=min_i; i< max_i; i++)
        {
            sum +=Array[i]*coefficients[globalId-i];    
        }
        Output[globalId]=sum;
}

has under 150 ms for cpu(8 compute unit) and under 80ms for gpu(25 compute unit) compute times. Work per item is only 65 times. This low number of operations could be very easily accelerated using __constant and __read_only and __write_only parameter specifiers and some integer work reducing.

Using float4 instead of float type for Array and Output should increase speed by %80 for both your cpu and gpu since those are SIMD type and vector compute units.

Bottlenecks of this kernel are:

Only 65 multiplications and 65 summations per thread.
But still the data travels over pci-express interface, slow.
Also 1 conditional check( i < max_i) per float operation is high, needs loop unrolling.
Everything being scalar although your cpu and gpu are vector based.

Generally:

Running kernel for the first time triggers just in time compiler optimization of opencl, slow. Run at least 5-10 times for exact timings.
__constant space is only 10 - 100 kB but its faster than __global and is good for amd's hd5000 series.
Kernel overhead is 100 microseconds while 65 cache operations are less than that and is shadowed by kernel overhead time(and even worse, by pci-e latency).
Too few work items makes occupation ratio less, slow.

Also:

4-core Xeon @ 3 GHz is much faster than 16(1/4 of vliw5)*2(compute units)=32 cores of gpu @600 MHz because of branch prediction, total cache bandwidth, instruction latency and no-pcie latency.
HD5000 series amd cards are legacy, same as gimped.
HD5450 has 166 GB/s constant-memory bandwidth
Which also has only 83 GB/s LDS(local memory) bandwidth
Which also has 83 GB/s L1 and L2 cache bandwidths so just let it work on __global driver optimizations instead of LDS unless you plan on upgrading your computer.(for Array ofcourse) Maybe, odd elements from LDS, even elements from __global could have 83+83 = 166 GB/s bandwidth. You can try. Maybe two by two is better than alternating in terms of bank conflicts.
Using coefficients as __constant (166 GB/s) and Array as __global should give you 166 + 83 = 249 GB/s combined bandwidth.
Each coefficient element is used for only once per thread so I'm not suggesting to use private registers (499 GB/s)

Optimizing kernel code in opencl for a GPU

2 Answers