4
votes

I´m currently trying to transpose a Matrix in OpenCl with memory coalescing.

I've already tansposed the Matrix in a "simple" way which worked perfectly fine. When I tried to do the same thing now with memory coalescing, i was hoping to see a little improvement in the execution time, but my implementation is actually slower than the simple implementation (The implementation is correct, it's just not efficent). I think I haven't exactly understood how to ensure that the horizontally neighboring work-items write on horizontally neighboring addresses.

Here is the Kernel for my coalisced implementation:

__kernel void MatrixTranspose(__global const float* Matrix, 
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {

    int2 GlobalID;
    GlobalID.x = get_global_id(0);
    GlobalID.y = get_global_id(1);

    int2 LocalID;
    LocalID.x = get_local_id(0);
    LocalID.y = get_local_id(1);

    block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];

    barrier(CLK_LOCAL_MEM_FENCE);

    int2 groupId;
    groupId.x = get_group_id(0);
    groupId.y = get_group_id(1);
    int2 localSize;
    localSize.x = get_local_size(0);
    localSize.y = get_local_size(1);
    MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}

I hope someone can give me an advice, thank you :)

2

2 Answers

2
votes

Unfortunately, you are going to be bound by your global read and write speed of the device. Normally you transpose the matrix to do some calculation, and that helps hide the latency. You are reading to local memory, waiting for the barrier, and writing black to global in your example. This only adds the extra step and complexity of using local memory.

You should do something with the data while it is in local memory if you want to hide the global memory latency.

If all you want to do is transpose the matrix, simply read from global and write to the target location in global directly. Maybe look into async_work_group_copy if you still want to try using local memory.

Now for my answer.

Try making a work item responsible for more than a single float. If you read a 4x4 region with a work item, you can transpose it in private memory. This would not only skip local memory, but eliminate the need for a barrier, and reduce the number of work items you need by a factor of 16.

steps:

  • calculate src and dest global memory addresses
  • load four float4 values from global
  • transpose the 4x4 floats by swapping their w,x,y,z values accordingly
  • store 4 float4 values at new location in global memory
  • handle the edge regions of the matrix in a separate kernel, or in the host program for matrices with non-multiple-of-four dimensions (or pad your input matrix to make it multiple of 4)
1
votes

Read columns from original matrix, write them as rows in local memory to avoid memory bank conflicts, and then store rows from local memory into transposed matrix.

In this case two write operations are easy to coalesce (neighbour Work Items write to neighbour memory cells). Read operations are not so good, however.

BTW, what is your Device? If it's good with vector operations, use vload/vstore operations, it may improve IO performance significantly.