I'm running a mandelbrot generator (2D image from static params) on OpenCL. The program is straightforward:
__kernel
void mandelbrot(__global uchar * output,
const float xstep,
const float xoffset,
const float ystep,
const float yoffset,
const int maxiter)
{
int gid_y = get_global_id(1);
int gid_x = get_global_id(0);
//calculate x and y on the fly for every pixel.
//This is just as fast as reading precalculated rulers from global memory.
float x = gid_x * xstep + xoffset;
float y = gid_y * ystep + yoffset;
float real = 0;
float imag = 0;
int out = 0;
for(int curiter = 0; curiter < maxiter; curiter++) {
float nreal = real*real - imag*imag + x;
imag = 2* real*imag + y;
real = nreal;
if (real*real + imag*imag > 4.0f) {
out = curiter;
break;
}
}
//normalize output
out *= 256.0 / (float)maxiter;
output[gid_y * get_global_size(0) + gid_x] = out;
}
[EDIT] [posted full kernel, and swapped rows and columns as suggested. This way I gained 18% performance on AMD, but 0% on NVidia. The original code was
output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;
[/EDIT]
I'm running it on my Nvidia Quadro 1000M, which has 2 compute units and 96 CUDA cores (48 cores per compute unit).
I'm playing around by changing the local group size when enqueuing the kernel. These are the performance results I get with different sizes when generating a 400Mpixel image. All numbers are from the OpenCL profiler and exclude the final memory copy back to the OS. The image is 40992x10272 - both height and width are divisible by 48.
rows x columns
8x8: 397 MPixel/s
8x12: 505 MPixel/s
8x16: 523 MPixel/s
8x24: 521 MPixel/s
8x32: 520 MPixel/s
8x48: 520 MPixel/s
1x48: 321 MPixel/s
2x32: 424 MPixel/s
2x48: 523 MPixel/s
4x24: 519 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
4x48: 525 MPixel/s
12x8: 490 MPixel/s
12x12:464 MPixel/s
12x24:505 MPixel/s
12x32:508 MPixel/s
12x48:433 MPixel/s
16x8: 499 MPixel/s
16x12:499 MPixel/s
16x16:472 MPixel/s
16x24:450 MPixel/s
16x32:440 MPixel/s
16x48:418 MPixel/s
Some of these numbers leave me baffled. While it is clear why I get best results with 48 columns (thanks to how SIMD operations work), I don't understand:
- why does performance degrade dramatically when I use 16 rows per group?
- why do I get poor performance with 1x48?
- why in heaven do I get top performance with 3x32, 4x32, and 8x32?!? I would have expected 33% of the SIMD processors to sit idle, and instead it looks like a workgroup is sitting in between the two compute units?!?
- why does PREFERRED_WORK_GROUP_SIZE_MULTIPLE return 32 instead of 48?
- is there a non-empirical way to figure out the geometry for top performance on any GPU (ATI/Nvidia/Intel HD), given only what I acquire from the OpenCL info structures?
Thanks in advance