So lets say I have a global array of memory:
|a|b|c| |e|f|g| |i|j|k| |
There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads):
0 -> 1 -> 2 -> 3
t1 a -> b -> c -> .
t2 e -> f -> g -> .
t3 i -> j -> k -> .
t4 . . . `> .
The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern can be easily unrolled by the compiler.
The second pattern:
0 -> 1 -> 2 -> 3
t1 a -> e -> i -> .
t2 b -> f -> j -> .
t3 c -> g -> k -> .
t4 . . . `> .
The above pattern accesses memory in strides: for example, thread 1 accesses a, then e, then i etc. This maximizes cache locality per unit time. Consider you have 64 work-items 'striding' at any given time slice. This means that, with a cache-line size of 64 bytes and elements of sizeof(float), work-items 1-16's read are cached by work-item 1's read. The data width/count per cell (where 'a' is a cell from above) has to be chosen carefully to avoid misaligned access. These loops don't seem to unroll as easily (or at all using Intel's Kernel Builder with the CPU). I believe this pattern would work well with a GPU.
I'm targeting GPUs with cache hierarchies. Specifically AMD's latest architecture (GCN). Is the second access pattern an example of 'coalescing'? Am I wrong in my thought process somewhere?