Which memory access pattern is more efficient for a cached GPU?

Question

So lets say I have a global array of memory:

|a|b|c| |e|f|g| |i|j|k| |

There are four 'threads' (local work items in OpenCL) accessing this memory, and two possible patterns for this access (columns are time slices, rows are threads):

   0 -> 1 -> 2 -> 3
t1 a -> b -> c -> .
t2 e -> f -> g -> .
t3 i -> j -> k -> .
t4 .    .    . `> .

The above pattern splits the array in to blocks with each thread iterating to and accessing the next element in a block per time slice. I believe this sort of access would work well for CPUs because it maximizes cache locality per thread. Also, loops utilizing this pattern can be easily unrolled by the compiler.

The second pattern:

   0 -> 1 -> 2 -> 3
t1 a -> e -> i -> .
t2 b -> f -> j -> .
t3 c -> g -> k -> .
t4 .    .    . `> .

The above pattern accesses memory in strides: for example, thread 1 accesses a, then e, then i etc. This maximizes cache locality per unit time. Consider you have 64 work-items 'striding' at any given time slice. This means that, with a cache-line size of 64 bytes and elements of sizeof(float), work-items 1-16's read are cached by work-item 1's read. The data width/count per cell (where 'a' is a cell from above) has to be chosen carefully to avoid misaligned access. These loops don't seem to unroll as easily (or at all using Intel's Kernel Builder with the CPU). I believe this pattern would work well with a GPU.

I'm targeting GPUs with cache hierarchies. Specifically AMD's latest architecture (GCN). Is the second access pattern an example of 'coalescing'? Am I wrong in my thought process somewhere?

yes; the second pattern is coalescing, and is generally to be preferred on stream processors, or whatever you want to call them. I havnt worked with AMD hardware myself, but I imagine the same must hold there — Eelco Hoogendoorn
yes, the second one is the proper way to go. All your explanation is fine — DarkZeros
Column vs row access is only for 2D-array global memory accesses as far as I know. The first form will use 1/3 of total mcache lines at a time. The second will use all cache lines. Second works faster for matrix multiplications with blocking algorithms on my HD7870. — huseyin tugrul buyukisik
@huseyintugrulbuyukisik Buffers on the stack should store data the same way regardless of dimension, isn't the dimension just an abstraction? Now 2DImages are a bit different, and can be cached in more than one dimension (say elements above, below and aside instead of just ahead). — bhimberg

Kurt Kurt · Accepted Answer · 2014-01-14T17:16:30

I think the answer depends on whether or not the accesses are to global or local memory. If you are pulling the data from global memory, then you need to worry about coalescing the reads (ie contiguous blocks, second example). However, if you are pulling the data from local memory, then you need to worry about bank conflicts. I have some but not a lot of experience, so I'm not stating this as absolute truth.

Edit: After reading up on GCN, I don't think the caches make a difference here. You can basically think of them as just speeding up global memory if you repeatedly read/write the same elements. On a side note, thanks for asking the question, because reading up on the new architecture is pretty interesting.

Edit 2: Here's a nice Stack Overflow discussion of banks for local and global memory: Why aren't there bank conflicts in global memory for Cuda/OpenCL?

Which memory access pattern is more efficient for a cached GPU?

1 Answers