0
votes

I have started to use OpenCL for the first time and I'm trying to do this example for sobel edge detection in this site http://www.karlosp.net/blog/2012/05/03/opencl-opencv-sobel-edge-detector/ but when run the kernal for gpu number fps is less than 15 and the gpu utilization is less than 5% how can run all thread for gpu like openmp to make utilization at less 95%

the code

kernal code

_kernel void sobel(_global float *A, __global float *R, uint width, uint height) {

int globalIdx = get_global_id(0);
int globalIdy = get_global_id(1);
int index = width * globalIdy + globalIdx;
float a,b,c,d,e,f,g,h,i;
float sobelX = 0;
float sobelY = 0;
if(index > width && index < (height*width)-width && (index % width-1) > 0 && (index % width-1) < width-1){
    a = A[index-1-width] * -1.0f;
    b = A[index-0-width] *  0.0f;
    c = A[index+1-width] * +1.0f;
    d = A[index-1] * -2.0f;
    e = A[index-0] *  0.0f;
    f = A[index+1] * +2.0f;
    g = A[index-1+width] * -1.0f;
    h = A[index-0+width] *  0.0f;
    i = A[index+1+width] * +1.0f;
    sobelX = a+b+c+d+e+f+g+h+i;

    a = A[index-1-width] * -1.0f;
    b = A[index-0-width] * -2.0f;
    c = A[index+1-width] * -1.0f;
    d = A[index-1] * 0.0f;
    e = A[index-0] * 0.0f;
    f = A[index+1] * 0.0f;
    g = A[index-1+width] * +1.0f;
    h = A[index-0+width] * +2.0f;
    i = A[index+1+width] * +1.0f;
    sobelY = a+b+c+d+e+f+g+h+i;
}


R[index] = sqrt(pow(sobelX,2) + pow(sobelY,2));

}

1
Are you using clenqueuewrite/read at each iteration? That would decrease performance. If you can do same computation only on gpu, its faster because edge detection is not computationally heavy. It uses mem access mostly. You can optimize the mem access using a shared-mem optimization per thread block if pci-e optimization is not enough. youtube.com/watch?v=602XVhl2QMY this example is many times faster for 256x256 texture resolution when optimized for non-pci-e access even with particle interactions + deformable mesh calcs. Cl-gl interop is the thing's name. Put some code please.huseyin tugrul buyukisik
Thank you for the answering . can you help me any code i want to make the gpu run 100% like openmp for cpu threads my actual program read the frame from camera or video and make some process on cpu and then send to gpu to make some process i want canny edge detection and hougth transform and then return to cpu to complete process. the sobel detection run 15fps for resolution 640*480?? i use clenqueuewrite/read to to read and write from gpu after each frame is this possible to make real time processing???user1848223
Pci-e bandwidth is usually 4-5GB per second. This means 1 billion 32-bit floats per second. 640*480=307200 and multiplying by 4 bytes(rgba) per pixel gives 1228800 bytes per frame which is 1/4000 th of pci-e bandwidth. You have 15FPS so it is sending probably 16MB per second which is too few. BUT the sending time makes gpu idle. Idling for even a millisecond will give GPU a lower usage percent. If you dont want interop then you can do asynchronous work. First, send 50 to-be-processed frames to gpu. Start the kernel. But in the mean time send and receive new datas while keeping gpu busy.huseyin tugrul buyukisik
Then you can keep the gpu busy with the pre-loaded frames onto gpu-memory. I think its just the reality of you have a high-end gpu I think :)huseyin tugrul buyukisik
You cannot get 100%, even if you optimize it heavily. 5% does not mean the GPU is running at 5%, it mean that the average usage is 5%. If your code runs fast, and the I/O is a bottleneck (as well as the CPU queing data). You will barely go over that level. IE: Very heavy computational tasks usually reach only 50% is the CPU is also bottlenecked.DarkZeros

1 Answers

0
votes

The code you link to has a few inefficiencies (no particular order):

  • The calls to clFinish after every OpenCL call inside the loop are unnecessary. Firstly, the clEnqueueWriteBuffer and clEnqueueReadBuffer are both using blocking set to CL_TRUE, i.e. they will not return until the write/read has finished (the definition of a blocking function call).

  • You convert your unsigned char image into float image before sending it to the GPU. This is not necessary as the GPU is capable of working with unsigned char and can typecast to float if needed. Converting to float on the CPU followed by sending to the GPU causes one to send 4x the amount of data (4 bytes per pixel per channel vs. 1 byte).

  • You call cvWaitKey(10), pausing for 10 milliseconds to wait for a keypress, hence this code will never exceed 100 frames per second (minor issue).

  • The RGB to gray-scale conversion can be done on the GPU, at the expense of sending 3 unsigned chars so there is a trade off there that needs testing.

Your timing approach is also flawed. Your code measures the time taken to load, process and display a single frame. Processing consists of both OpenCL and OpenCV components. You should benchmark each of these separately to determine how long each one takes so that you can determine exactly where your bottleneck lies.

What also just occurred to me is what is the frame-rate of the camera that OpenCV is capturing from?