OpenCL - Local Memory efficiency

Question

I have an AMD GPU and I want to implement 'Matrix Transpose' example. Imagine two scenarios for the implementation :

1)

Read from global memory (current location)
Write to global memory (target location)

2)

Read from global memory (current position)
Write to local memory
Read from local memory
Write to global memory (target position)

Assume that I've picked the best work-group size for both solutions. By the way, the 2nd algorithm takes advantage of collaborative write to local memory.

Finally, surprisingly the second scenario turns out to be twice as fast as the 1st scenario. I just can't understand why ?

I can see that in the 1st one we have 1 read and 1 write from and to global memory, and in the 2nd one, in addition to global memory operations, we've 1 read and 1 write from and to local memory, how can it be faster ?

I'd be happy if anyone helps me in this case.

Thank you in advance :-)

fjarri fjarri · Accepted Answer · 2013-07-30T02:23:41

I have an AMD GPU and I want to implement 'Matrix Transpose' example.

(Just in case you were not aware, nVidia SDK contains an OpenCL matrix multiplication example (perhaps AMD APP has one too), so you can later compare your code with theirs.)

The first variant will inevitably either read or write matrix elements with non-sequential memory locations from sequential work items. This means that every such access will have to be performed separately, and since global memory access has significant latency, it slows down your code.

The second variant takes advantage of the feature called coalescing. The video card driver can join several memory requests from sequential work items to sequential memory locations (there are some nuances, consult the programming guide for details) into one big request and read, say, 8 floats at once. So instead of 8 long requests you only have one, which gives a significant increase in performance.

But in the matrix multiplication algorithm, in order to make your global memory access sequential, the piece of data you processed in one work item will have to be stored by another item. That's why you have to use local memory — you fill it, synchronize work group, and then store its contents to global memory — but in the different order, so that the storage process is sequential. Yes, it involves some additional reads/writes to local memory, but these have much less latency, so if you coalesce enough global memory operations, you are winning overall.

OpenCL - Local Memory efficiency

1 Answers