I´m currently trying to transpose a Matrix in OpenCl with memory coalescing.
I've already tansposed the Matrix in a "simple" way which worked perfectly fine. When I tried to do the same thing now with memory coalescing, i was hoping to see a little improvement in the execution time, but my implementation is actually slower than the simple implementation (The implementation is correct, it's just not efficent). I think I haven't exactly understood how to ensure that the horizontally neighboring work-items write on horizontally neighboring addresses.
Here is the Kernel for my coalisced implementation:
__kernel void MatrixTranspose(__global const float* Matrix,
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {
int2 GlobalID;
GlobalID.x = get_global_id(0);
GlobalID.y = get_global_id(1);
int2 LocalID;
LocalID.x = get_local_id(0);
LocalID.y = get_local_id(1);
block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];
barrier(CLK_LOCAL_MEM_FENCE);
int2 groupId;
groupId.x = get_group_id(0);
groupId.y = get_group_id(1);
int2 localSize;
localSize.x = get_local_size(0);
localSize.y = get_local_size(1);
MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}
I hope someone can give me an advice, thank you :)