4
votes

All, I wrote a very simple OpenCL kernel which transforms an RGB image to gray scale using simple averaging.

Some background:

  1. The image is stored in mapped memory, as a 24 bit, non padded memory block
  2. The output array is stored in pinned memory (mapped with clEnqueueMapBuffer) and is 8 bpp
  3. There are two buffers allocated on the device (clCreateBuffer), one is specifically read (which we clWriteBuffer into before the kernel starts) and the other is specifically write (which we clReadBuffer after the kernel finishes)

I am running this on a 1280x960 image. A serial version of the algorithm averages 60ms, the OpenCL kernel averages 200ms!!! I'm doing something wrong but I have no idea how to proceed, what to optimize. (Timing my reads/writes without a kernel call, the algorithm runs in 15ms)

I am attaching the kernel setup (sizes and arguments) as well as the kernel


EDIT: So I wrote an even dumber kernel, that does no global memory accesses inside it, and it was only 150ms... This is still ridiculously slow. I thought maybe I'm messing up with global memory reads, they have to be 4 byte aligned or something? Nope...

Edit 2: Removing the all parameters from my kernel gave me significant speed up... I'm confused I thought that since I'm clEnqueueWriteBuffer the kernel should be doing no memory transfer from host->device and device->host....

Edit 3: Figured it out, but I still don't understand why. If anyone could explain it I would be glad to award correct answer to them. The problem was passing the custom structs by value. It looks like I'll need to allocate a global memory location for them and pass their cl_mems


Kernel Call:

//Copy input to device
result = clEnqueueWriteBuffer(handles->queue, d_input_data, CL_TRUE, 0, h_input.widthStep*h_input.height, (void *)input->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to input buffer on device!")) return 0;

//Set kernel arguments
result = clSetKernelArg(handles->current_kernel, 0, sizeof(OpenCLImage), (void *)&h_input);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 1, sizeof(cl_mem), (void *)&d_input_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set input data.")) return 0;
result = clSetKernelArg(handles->current_kernel, 2, sizeof(OpenCLImage), (void *)&h_output);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output struct.")) return 0;
result = clSetKernelArg(handles->current_kernel, 3, sizeof(cl_mem), (void *)&d_output_data);
if(check_result(result, "opencl_rgb_to_gray", "Failed to set output data.")) return 0;

//Determine run parameters
global_work_size[0] = input->width;//(unsigned int)((input->width / (float)local_work_size[0]) + 0.5);
global_work_size[1] = input->height;//(unsigned int)((input->height/ (float)local_work_size[1]) + 0.5);

printf("Global Work Group Size: %d %d\n", global_work_size[0], global_work_size[1]);

//Call kernel
result = clEnqueueNDRangeKernel(handles->queue, handles->current_kernel, 2, 0, global_work_size, local_work_size, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to run kernel!")) return 0;

result = clFinish(handles->queue);
if(check_result(result, "opencl_rgb_to_gray", "Failed to finish!")) return 0;

//Copy output
result = clEnqueueReadBuffer(handles->queue, d_output_data, CL_TRUE, 0, h_output.widthStep*h_output.height, (void *)output->imageData, 0, 0, 0);
if(check_result(result, "opencl_rgb_to_gray", "Failed to write to output buffer on device!")) return 0;

Kernel:

typedef struct OpenCLImage_t
{
    int width;
    int widthStep;
    int height;
    int channels;
} OpenCLImage;

__kernel void opencl_rgb_kernel(OpenCLImage input, __global unsigned char*  input_data, OpenCLImage output, __global unsigned char * output_data)
{
    int pixel_x = get_global_id(0);
    int pixel_y = get_global_id(1);
    unsigned char * cur_in_pixel, *cur_out_pixel;
    float avg = 0;

    cur_in_pixel = (unsigned char *)(input_data + pixel_y*input.widthStep + pixel_x * input.channels);
    cur_out_pixel = (unsigned char *)(output_data + pixel_y*output.widthStep + pixel_x * output.channels);

    avg += cur_in_pixel[0];
    avg += cur_in_pixel[1];
    avg+= cur_in_pixel[2];
    avg /=3.0f;

    if(avg > 255.0)
        avg = 255.0;
    else if(avg < 0)
        avg = 0;

    *cur_out_pixel = avg;
}
2
You haven't mentioned what kind of device is running the CL kernel, but odds are it doesn't have 1280*960=1.2 million compute cores. Try writing a kernel that operates on more than one pixel at a time, rather than burden the scheduler with thousands of threads per core. - user57368
This was not the issue, but I did switch to a larger local size by using clGetWorkGroupInfo to determine my local work group sizes. - Constantin
@user57368 Actually, you want to burden the scheduler with thousands of pending threads, that's how OpenCL works. On a GPU, if you don't do that, you'll get abysmal performance. On a CPU, it is less important, but you still want to, otherwise you'll be spending all your time issuing kernel calls. And, yes, the OpenCL implementation is not dumb and won't spawn one OS thread per kernel instance, it is smarter than that. - Thomas
A thousand threads for a thousand cores is how OpenCL should be used on a GPU. That's not the same as a thousand threads per core, which is what's posted above. - user57368
@user57368 GPU's don't have "a thousand cores", far from that. In general, you want to enqueue as many work items as you need to complete your task, for instance, if you want to grayscale a 2048x2048 image, you launch 2048x2048 work items, and each of them processes 1 pixel (in the simplest case). The OpenCL implementation does not launch one thread per work item. On the CPU, for instance, it launches as many threads as you have execution units, and iterates the kernel with a different globalID/localID, distributing the work over those threads. For the GPU, things are essentially the same. - Thomas

2 Answers

5
votes

Overhead of copying the value to all the threads that will be created might be the possible reason for the time; where as for a global memory the reference will be enough in the other case. The only the SDK implementer will be able to answer exactly.. :)

0
votes

You may want to try a local_work_size like [64, 1, 1] in order to coalesce your memory calls. (note that 64 is a diviser of 1280).

As previously said, you have to use a profiler in order to get more informations. Are you using an nvidia card ? Then download CUDA 4 (not 5), as it contains an openCL profiler.

Your performance must be far from the optimum. Change the local work size, the global work size, try to treat two or four pixels per tread. Can you change the way pixels are stored privous to your treatment? Then break your struct for tree arrays in order to coalesce memomry access more effectively.

Tou can hide your memory transfers with the GPU work: it will be more easy to do with a profiler near you.