As of now, my GPU is slower than my CPU when it comes to kernel execution time. I thought maybe since I was testing with a small sample, the CPU ended up finishing faster because of a smaller startup overhead. However, when I tested the kernel with data almost 10 times the size of the sample, the CPU was still finishing faster and the GPU was almost 400ms behind.
Runtime with 2.39MB file CPU: 43.511ms GPU: 65.219ms
Runtime with 32.9MB file CPU: 289.541ms GPU: 605.400ms
I tried using local memory, although I'm 100% sure I was using it wrong, and ran into two issues. The kernel finishes anywhere between 1000-3000ms (depending on what size I set for localWorkSize) or I run into a status code of -5, which is CL_OUT_OF_RESOURCES.
Here is the kernel that a fellow SO member helped me out with.
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 65; i++)
{
float tmp=0;
if (globalId+i > 63)
{
tmp=Array[i+globalId-64]*coefficients[64-i];
}
sum += tmp;
}
Output[globalId]=sum;
}
This was my attempt at using local memory. First bit will be a snippet from the host code and the following portion is the kernel.
//Set the size of localMem
status |= clSetKernelArg(
kernel,
2,
1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements)
null);
printf("Kernel Arg output status: %i \n", status);
//set a localWorkSize
localWorkSize[0] = 64;
//execute the kernel with localWorkSize included
status = clEnqueueNDRangeKernel(
cmdQueue,
kernel,
1,
NULL,
globalWorkSize,
localWorkSize,
0,
NULL,
&someEvent);
//Here is what I did to the kernel***************************************
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) {
int globalId = get_global_id(0);
int localId = get_local_id(0);
localMem[localId] = globalId[globalId];
float sum=0.0f;
for (int i=0; i< 65; i++)
{
float tmp=0;
if (globalId+i > 63)
{
tmp=localMem[i+localId-64]*coefficients[64-i];
}
sum += tmp;
}
Output[globalId]=sum;
}
Reference link I used when trying to set local variables: How do I use local memory in OpenCL?
Link used to find kernelWorkGroupSize (this is why I have 1024 set in the kernelArg): CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?
I've seen other people have similar problems where the GPU is slower than the CPU but for many of them, they are using clEnqueueKernel instead of clEnqueueNDRangeKernel.
Heres my previous question if you need more info on this kernel: Best approach to FIFO implementation in a kernel OpenCL
Found some optimization tricks for GPU's aswell. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf
Edited code; Error still exists
__kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) {
int globalId = get_global_id(0);
float sum=0.0f;
float tmp=0.0f;
for (int i=64-globalId; i< 65; i++)
{
tmp = 0.0f;
tmp=Array[i]*coefficients[i];
sum += tmp;
}
Output[globalId]=sum;
}
if()
-statement in your innerfor
-loop. A smart compiler may be able to hoist theif
out of the loop, but a gpu-driver probably doesn't have the time or the smarts to do this efficiently. – EOFif (globalId+i > 63)
is quite obviously equivalent toif (globalId+i >= 64)
, which is obviously equivalent toif (i >= 64 - globalId)
, which is easily eliminated by changing the for-loops initialization statement:for (int i=64-globalId; i< 65; i++)
. – EOF