I have the following OpenCL Kernel, a Gaussian Blur
__constant sampler_t sampler =
CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST;
__constant float gaussian_kernel[3][3] = {
{0.0625f, 0.125f, 0.0625f},
{0.125f, 0.25f, 0.125f},
{0.0625f, 0.125f, 0.0625f} };
void kernel gaussian_blur(
read_only image2d_t input_image,
write_only image2d_t output_image) {
int x = get_global_id(0);
int y = get_global_id(1);
int2 coords[9] = {
{ x - 1, y - 1 }, { x, y - 1 }, { x + 1, y - 1 },
{ x - 1, y }, { x, y }, { x + 1, y },
{ x - 1, y + 1 }, { x, y + 1 }, { x + 1, y + 1 }
};
float4 pixel_value = { 0.f, 0.f, 0.f, 0.f };
for(int i = 0; i < 3; ++i) {
for(int j = 0; j < 3; ++j) {
int index = i * 3 + j;
float4 blurred =
as_float4(read_imageui(input_image, sampler, coords[index]));
pixel_value.x += (blurred.x * gaussian_kernel[i][j]);
pixel_value.y += (blurred.y * gaussian_kernel[i][j]);
pixel_value.z += (blurred.z * gaussian_kernel[i][j]);
pixel_value.w += (blurred.w * gaussian_kernel[i][j]);
}
}
uint4 final_value = as_uint4(pixel_value);
write_imageui(output_image, coords[4], final_value);
}
When I specify the device to use as the CPU, The blur works properly. Here's the device selection code
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if(all_platforms.size() == 0) {
std::cerr << "No platforms available" <<std::endl;
exit(-1);
}
cl::Platform default_platform = all_platforms[0];
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
if(all_devices.size() == 0) {
std::cerr << "No device found" << std::endl;
exit(-1);
}
cl::Device default_device = all_devices[1]; //Changing this index to 0 uses my graphics card
Now, if the default_device is set to the GPU, only an empty image is output by the program. The relevant image setup code is (note that input is a Magick::Image and in_pixels a heap allocated array of unsigned short):
cl::ImageFormat format(CL_RGBA, CL_UNSIGNED_INT16);
cl::Image2D input_image_buffer;
input.write(0, 0,
input.baseColumns(), input.baseRows(), "RGBA", Magick::ShortPixel, in_pixels);
input_image_buffer = cl::Image2D(
context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
format,
input.baseColumns(),
input.baseRows(),
0,
in_pixels,
&cl_error);
cl::Image2D output_image_buffer;
output_image_buffer = cl::Image2D(
context,
CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
format,
input.baseColumns(),
input.baseRows(),
0,
out_pixels,
&cl_error);
And the Kernel setup/Image output code (gaussian_program was built without error of course)
cl::Kernel gaussian_kernel(gaussian_program, "gaussian_blur");
cl::CommandQueue queue(context, default_device, 0, &cl_error);
cl::size_t<3> origin;
cl::size_t<3> size;
origin[0] = 0;
origin[1] = 0;
origin[2] = 0;
size[0] = input.baseColumns();
size[1] = input.baseRows();
size[2] = 1;
cl_error = gaussian_kernel.setArg(0, input_image_buffer);
cl_error = gaussian_kernel.setArg(1, output_image_buffer);
cl::NDRange range(input.baseColumns(), input.baseRows());
cl_error = queue.enqueueNDRangeKernel(
gaussian_kernel,
cl::NullRange,
range,
cl::NullRange);
queue.finish();
try{
output.read(
input.baseColumns(),
input.baseRows(),
"RGBA", Magick::ShortPixel, out_pixels);
}
catch(Magick::Exception& ex) {
std::cerr << "A Magick error occured while writing the pixel cache: " <<
std::endl << ex.what() << std::endl;
return false;
}
Now, I removed a lot of error checking for the purpose of this example, but the original code checks cl_error after every OpenCL call and never signals an error. The code executes as expected on the CPU but the image is empty when the code is executed on the GPU.
I suspected a synchronization issue at first (the queue.finish() call is required for that precise purpose, even on the CPU) but littering the code with cl::finish() or queue.finish() calls in an attempt to serialize the execution didn't help at all.
Is there something I'm clearly doing wrong? Is there a potential reason for this OpenCL kernel to fail on the GPU but not on the CPU?
For the record, I'm on Ubuntu 13.04 using the AMD APP SDK OpenCL implementation with a Radeon HD 7970.
printfcalls to a kernel. So I started debugging using such statements and noticed thefloat -> uintconversions weren't happening as I expected. Right as I was starting to investigate further, I saw your comment and made the fixes. You just put an end to a couple days of misery. If you post this comment as an answer, I'll mark it as accepted. - anthonyvd