Optimising Memory Access OpenCL

Question

I am currently multiplying to byte matrices in an openCL kernel, using a block matrix multiplication algorithm: I subdivide the matrix into tiles (32 x 32), load those tiles into local memory, and write this back to global memory.

Currently, memory access is the bottleneck. I'm trying to see how much I can optimise it.

Let's say that I'm multiplying C = A x B where A,B,C are char*

A(Ndim,Pdim), B(Pdim,MDim), C(Ndim,MDim).

I currently have A in row major format and B in column major format to ensure that memory accesses are sequential within a work group for each matrix.

Each work item loads a single byte into the local memory, and is responsible for processing that byte. The dimensiosn for my kernel are {Ndim,Mdim} for the global work items and {block_size,block_size} for the local work items.

The code is almost identical to http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf (with the exception that A is stored in column major format)

My question: how can I optimise memory accesses? I hear a lot about coalescing, but I'm struggling to understand what the tradeoff is between coalescing and parallelism.

Option 0: Leave it as it is, even if each thread accesses a byte, this gets coalesced so every thread within a workgroup gets data that is already accessed. -> unlikely, given my accesses are not byte aligned. I suspect I end up loading every time 4 bytes + x where x is the offset of the thread.

Option 1: Using Integer Matrices Reducing Parallelism If I were to have the matrices as integers, I would be able to load much more at a time, but would significantly reduce the parallelism (by a factor of 4), where each byte multiplication would have to be performed sequentially.

Option 2: Using Integer Matrices but Keep the Parallelism the same This basically means that the data in memory will be loaded multiple times by each Intuitively, this corresponds to int foo = get_global_id(0), and then, assuming I convert foo to char[] foo_bytes having byte x = foo[get_local_id(0)); My understanding is that the first thread will use get_global_id(0) to load the data into memory, whilst the remaining thread in the work group will see it already loaded

Option 3: Using Integer Matrices, Reducing Parallelism, but using vector types within a work-item to process the data I understand that opencl supports vector types, If I load a 32bit integer, I could convert this to a vector type so that the work item would process the 4 bytes in parallel. My understanding is that this is only syntactic and that I wouldn't get any performance improvement from using vector types like that in OpenCL.

From what I understand, option 2 is preferable. Is this correct? And if not, why?

wcochran wcochran · Accepted Answer · 2017-02-15T18:42:50

Memory coalescing is the single most important performance consideration for programming nVidia GPUs. If thread i is reading from memory location n, then have thread i+1 read from location n+1. If the threads are in the same warp, then these reads are "coalesced" into one transaction.

Notice that, in the nVidia example that loads each submatrix into shared memory, the matrices are both in row-major order. This means that the thread for (row,col) will read memory cell row x stride + col and the thread for (row,col+1) will read memory cell row x stride + col + 1 which indeed are next to each other in memory. This will be coelesced if the threads are in the same warp -- which is likely since the threads are ordered in row-major order.

If the matrices are in column major order THIS SCREWS EVERYTHING UP! The thread for (row,col+1) will read memory cell (col + 1) x stride + row which is NOT next to col x stride + row in memory!

Therefore, your little change to column-major order broke the most important thing to optimize in nVidia GPU's!

Optimising Memory Access OpenCL

2 Answers