9
votes

I'm running a mandelbrot generator (2D image from static params) on OpenCL. The program is straightforward:

__kernel
void mandelbrot(__global uchar * output, 
                const float xstep,
                const float xoffset,
                const float ystep,
                const float yoffset,
                const int maxiter)
{
    int gid_y = get_global_id(1);
    int gid_x = get_global_id(0);

    //calculate x and y on the fly for every pixel. 
    //This is just as fast as reading precalculated rulers from global memory.
    float x = gid_x * xstep + xoffset;
    float y = gid_y * ystep + yoffset;

    float real = 0;
    float imag = 0;

    int out = 0;

    for(int curiter = 0; curiter < maxiter; curiter++) {
        float nreal = real*real - imag*imag + x;
        imag = 2* real*imag + y;
        real = nreal;

        if (real*real + imag*imag > 4.0f) {
            out = curiter;
            break;
        }
    }

    //normalize output
    out *= 256.0 / (float)maxiter;
    output[gid_y * get_global_size(0) + gid_x] = out;

}

[EDIT] [posted full kernel, and swapped rows and columns as suggested. This way I gained 18% performance on AMD, but 0% on NVidia. The original code was

output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;

[/EDIT]

I'm running it on my Nvidia Quadro 1000M, which has 2 compute units and 96 CUDA cores (48 cores per compute unit).

I'm playing around by changing the local group size when enqueuing the kernel. These are the performance results I get with different sizes when generating a 400Mpixel image. All numbers are from the OpenCL profiler and exclude the final memory copy back to the OS. The image is 40992x10272 - both height and width are divisible by 48.

rows x columns
8x8: 397 MPixel/s
8x12: 505 MPixel/s
8x16: 523 MPixel/s
8x24: 521 MPixel/s
8x32: 520 MPixel/s
8x48: 520 MPixel/s

1x48: 321 MPixel/s
2x32: 424 MPixel/s
2x48: 523 MPixel/s
4x24: 519 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
4x48: 525 MPixel/s

12x8: 490 MPixel/s
12x12:464 MPixel/s
12x24:505 MPixel/s
12x32:508 MPixel/s
12x48:433 MPixel/s

16x8: 499 MPixel/s
16x12:499 MPixel/s
16x16:472 MPixel/s
16x24:450 MPixel/s
16x32:440 MPixel/s
16x48:418 MPixel/s

Some of these numbers leave me baffled. While it is clear why I get best results with 48 columns (thanks to how SIMD operations work), I don't understand:

  1. why does performance degrade dramatically when I use 16 rows per group?
  2. why do I get poor performance with 1x48?
  3. why in heaven do I get top performance with 3x32, 4x32, and 8x32?!? I would have expected 33% of the SIMD processors to sit idle, and instead it looks like a workgroup is sitting in between the two compute units?!?
  4. why does PREFERRED_WORK_GROUP_SIZE_MULTIPLE return 32 instead of 48?
  5. is there a non-empirical way to figure out the geometry for top performance on any GPU (ATI/Nvidia/Intel HD), given only what I acquire from the OpenCL info structures?

Thanks in advance

3
removed cuda tag - question has nothing to do with cudaRobert Crovella
Can you give more code from kernel?huseyin tugrul buyukisik
Why don't you simply let OpenCL handle the work size? It should hit the 525Mp/s in automatic mode (aka: pass a NULL value as local size). Since the automatic mode will select the higher value of threads without hitting the memory limit of the compute unit.DarkZeros
Posted full kernel on request.crusaderky
I actually tried passing NULL to local size. I get 473 MPixel/s - 90% of the performance I get by explicitly stating the local size.crusaderky

3 Answers

20
votes

I answered to a similar question here that you might find interesting before reading the following.

why does performance degrade dramatically when I use 16 rows per group?

Actually it already degrades when you use 12 rows. Memory access works by transaction. A transaction will fetch a certain number of bytes in one shot. Now if several workitems try to access several contiguous elements in an array it means that one transaction might be enough to serve them all.

Because you access the memory in this way:

output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;

it means that the bigger the local size is in the dimension 0, the bigger the number of transaction will be since you have to access non contiguous elements (separated by get_global_size(1) elements). And global memory access is expensive.

So in the case of the 12/16 rows, you have at least 12/16 transactions needed. This lead to your second question:

why do I get poor performance with 1x48?

Based on what I've just said before, it seems that the performance should be great, since the number of transactions would be minimal.

But here comes the problem of idling threads. The information you got regarding the 48 cores per SM is wrong as already pointed out by others. Threads are executed in group (called warp for NVIDIA) of 32 on NVIDIA hardware. Note these groups are called wavefront and can be up to 64 threads for AMD. Since you have in this case a workgroup composed of 48 threads (1 by 48), it means that 64 threads are scheduled. It is always a number of threads multiple of 32 that is scheduled because you can't execute a fraction of a warp.

Therefore in this case you have a fourth of the threads that do nothing. And actually when you compare with the result you obtained for 2x32 (still 64 threads - 2 warps, but fully utilized) 321 MPixel/s is pretty much 3/4 of 424 MPixel/s.

It is worth noting also this result: 2x48: 523 MPixel/s. In this case your workgroup size is 96 a multiple of 32. So no idling threads.

why in heaven do I get top performance with 3x32, 4x32, and 8x32?!?

Well, the answer comes from the two previous ones: You use multiple of 32, and you keep the number of threads in the dimension 0 relatively small. But let's have a closer look to your results:

2x32:  424 MPixel/s
3x32:  525 MPixel/s
4x32:  525 MPixel/s
8x32:  520 MPixel/s
16x32: 440 MPixel/s

The decrease of performance for the two last lines is easily explained with what was said. However, the increase of performance between the first and the second line is not.

The increase of performance comes form somewhere else in this case. The fact that in the second case enough warps run on the same SM to hide the access memory latency. You see the REFERRED_WORK_GROUP_SIZE_MULTIPLE value says only that you should try to use a MULTIPLE of this value for best performance. Several warps can be scheduled on the same SM at the same time.

So, how does it work? Let's take the 3x32 case. You have a workgroup composed of 3 warps. Since they belong to the same workgroup they are scheduled on the same SM as required by the OpenCL standard (if it wasn't the case, sync between threads within a workgroup wouldn't be possible).

The first warp starts to run until it gets stall because a memory access is needed. Meanwhile the warp 1 waits for the memory transactions to complete, the warp 2 can start to run. Since there is a lot of registers on the SM, the SM can easily and quickly switch context to run others warps. All the variables of the warp 1 stay on the registers allocated to the warp 1. Then the warp 2 hits the line where a memory access is required and gets stall. At that moment, the next ready to run warp can start running. It could be the warp 3 but also the warp 1 if its memory access is completed. In your case it seems that it is the warp 3 that runs, since you have a difference between 2x32 and 3x32. In the first case there are not enough warps scheduled to hide the memory accesses though in the second case there are.

As a mater of fact, this influence as well the bad performance for the 1x48 size from the question 2.

why does PREFERRED_WORK_GROUP_SIZE_MULTIPLE return 32 instead of 48?

Already answered.

is there a non-empirical way to figure out the geometry for top performance on any GPU (ATI/Nvidia/Intel HD), given only what I acquire from the OpenCL info structures?

It's like for any other languages. When you know how it works under the hood, it helps you to produce good first code. But you'll still have to benchmark it, and go through a process of trial and errors to tweak it. Keeping in mind what I've just written is only a small part of the things that matter for performance. Querying some info from OpenCL combined with a good understanding of CPU/GPU will obviously help... but that's it.

Because a lot of parameters influencing performance are antagonists, what you'll gain inone side, will be lost in the other.

Therefore keep benchmarking ;).

1
votes

It all depends on the code you are not showing. And that is the key. If your code was very simple ie: out = 8; then your supposition will probably be correct.

However, as you said, the value REFERRED_WORK_GROUP_SIZE_MULTIPLE returns 32. This means, that 32 is the maximum concurrent threads the compute unit can launch in parallel without affecting the performance. For example, there is no sense in launching more than 32. If with 32 you already deplete the local memory storage and you need to recur to global memory (which is dammly slow).

If you try to go over the recomended limit, you obtain exactly that -> A performance decrease. It is not that 32 is better, is the oposite. 48 is bad.

I recomend to you:

  1. Use the automatic size if possible (pass null as local size to the kernel). This leads to the maximum performance if you are not worried about the local worksize shape.
  2. Use the REFERRED_WORK_GROUP_SIZE_MULTIPLE as a reference if you need to set the local size manually.
0
votes

The way your kernel accesses global memory is critical, and determined by the work group and global dimensions:

  • what addresses will be written by consecutive work items in the same work group? Here the stride is get_global_size(1), you may want to swap X and Y. It is generally faster to address consecutive elements in consecutive work items. This is the most important factor.

  • what addresses will be written by consecutive work groups? Consecutive work groups will frequently be scheduled at the same time on different compute units. They may end up competing for the same channel/bank, resulting in loss of performance.

  • it is generally preferable to write 32-bit integers instead of bytes.

To maximize performance, I suggest you introduce more buttons to turn: write kernels computing a block of several pixels (4x2 for example) inside a single work-item, and then benchmark all combinations of (block size) x (work-group size) x (XY swap) x (image size). Then pick the best for your GPU.