I have an AMD GPU and I want to implement 'Matrix Transpose' example. Imagine two scenarios for the implementation :
1)
Read from global memory (current location)
Write to global memory (target location)
2)
Read from global memory (current position)
Write to local memory
Read from local memory
Write to global memory (target position)
Assume that I've picked the best work-group size for both solutions. By the way, the 2nd algorithm takes advantage of collaborative write to local memory.
Finally, surprisingly the second scenario turns out to be twice as fast as the 1st scenario. I just can't understand why ?
I can see that in the 1st one we have 1 read and 1 write from and to global memory, and in the 2nd one, in addition to global memory operations, we've 1 read and 1 write from and to local memory, how can it be faster ?
I'd be happy if anyone helps me in this case.
Thank you in advance :-)