Why OpenCL work group size has huge performance impact on GPU?

Question

I am benchmarking a simple matrix transposition kernel on Qualcomm Adreno 630 GPU, and I am trying to see the impact of different work group size, but surprisingly, I get some interesting result which I cannot explain. Here is my kernel code:

__kernel void transpose(__global float *input, __global float *output, const int width, const int height)
    int i = get_global_id(0);
    int j = get_global_id(1);
    output[i*height + j] = input[j*width + i];
}

and the width and height are both 6400, the experiment results are(execution time is the difference between END and START event):

work group size      execution time
x     y
4    64              24ms
64   4               169ms
256  1               654ms
1    256             34ms
8    32              27ms
1    1024            375ms
1024 1               657ms
32   32              26ms

after this I did another experimemnt where I change the width and height from 6400 to 6401(and the global work size in NDRangeKernel call as well), and the result is even more interesing:

work group size      execution time
x     y
4    64              28ms
64   4               105ms
256  1               359ms
1    256             31ms
8    32              32ms
1    1024            99ms
1024 1               358ms
32   32              32ms

execution time of most scenarios drops significantly. I know memory coalescing or cache could play a role here, but I cannot completely explain this.

ProjectPhysX ProjectPhysX · Accepted Answer · 2020-08-07T16:05:47

Memory coalescence occurs when consecutive threads access data at consecutive global memory addresses within a 128-byte aligned segment. Then memory accesses are coalesced into one, significantly reducing overall latency.

In the 2D range, coalescing only happens along get_global_id(1) or the j direction in your case. In the line output[i*height + j] = input[j*width + i];, input[j*width + i]; is a misaligned (non-coalesced) read and output[i*height + j] is a coalesced write. Coalesced memory access generally is much faster than misaligned access, but the performance penalty for coalesced/misaligned reads can be vastly different than coalesced/misaligned writes. On most desktop GPU architectures, the combination misaligned read and coalesced write is faster than the other way around, see the diagram below. So your implementation should be the faster variant already.

Since coalesced access is only possible along the j index, if you have a range of (x=256,y=1) (i along x-direction, j along y-direction), you do not get any coalescing. For (x=8,y=32), j is coalesced in groups of 32 8 times per thread block, so memory bandwidth is fairly saturated and performance is good.

If you want maximum possible performance, I'd suggest you go with 1D indexing. This way you have full control about coalescing and coalescing happens over the entire thread block. Your matrix transpose kernel then would look like this:

#define width 6400
__kernel void transpose(__global float *input, __global float *output) {
    const int n = get_global_id(0);
    int i = n/width;
    int j = n%width;
    output[i*height + j] = input[j*width + i];
}

You can bake width into the OpenCL Ccode at C++ runtime and before OpenCL compile time via string concatenation.

Why OpenCL work group size has huge performance impact on GPU?

1 Answers