1
votes

I am currently multiplying to byte matrices in an openCL kernel, using a block matrix multiplication algorithm: I subdivide the matrix into tiles (32 x 32), load those tiles into local memory, and write this back to global memory.

Currently, memory access is the bottleneck. I'm trying to see how much I can optimise it.

Let's say that I'm multiplying C = A x B where A,B,C are char*

A(Ndim,Pdim), B(Pdim,MDim), C(Ndim,MDim).

I currently have A in row major format and B in column major format to ensure that memory accesses are sequential within a work group for each matrix.

Each work item loads a single byte into the local memory, and is responsible for processing that byte. The dimensiosn for my kernel are {Ndim,Mdim} for the global work items and {block_size,block_size} for the local work items.

The code is almost identical to http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf (with the exception that A is stored in column major format)

My question: how can I optimise memory accesses? I hear a lot about coalescing, but I'm struggling to understand what the tradeoff is between coalescing and parallelism.

Option 0: Leave it as it is, even if each thread accesses a byte, this gets coalesced so every thread within a workgroup gets data that is already accessed. -> unlikely, given my accesses are not byte aligned. I suspect I end up loading every time 4 bytes + x where x is the offset of the thread.

Option 1: Using Integer Matrices Reducing Parallelism If I were to have the matrices as integers, I would be able to load much more at a time, but would significantly reduce the parallelism (by a factor of 4), where each byte multiplication would have to be performed sequentially.

Option 2: Using Integer Matrices but Keep the Parallelism the same This basically means that the data in memory will be loaded multiple times by each Intuitively, this corresponds to int foo = get_global_id(0), and then, assuming I convert foo to char[] foo_bytes having byte x = foo[get_local_id(0)); My understanding is that the first thread will use get_global_id(0) to load the data into memory, whilst the remaining thread in the work group will see it already loaded

Option 3: Using Integer Matrices, Reducing Parallelism, but using vector types within a work-item to process the data I understand that opencl supports vector types, If I load a 32bit integer, I could convert this to a vector type so that the work item would process the 4 bytes in parallel. My understanding is that this is only syntactic and that I wouldn't get any performance improvement from using vector types like that in OpenCL.

From what I understand, option 2 is preferable. Is this correct? And if not, why?

2

2 Answers

3
votes

Memory coalescing is the single most important performance consideration for programming nVidia GPUs. If thread i is reading from memory location n, then have thread i+1 read from location n+1. If the threads are in the same warp, then these reads are "coalesced" into one transaction.

Notice that, in the nVidia example that loads each submatrix into shared memory, the matrices are both in row-major order. This means that the thread for (row,col) will read memory cell row x stride + col and the thread for (row,col+1) will read memory cell row x stride + col + 1 which indeed are next to each other in memory. This will be coelesced if the threads are in the same warp -- which is likely since the threads are ordered in row-major order.

If the matrices are in column major order THIS SCREWS EVERYTHING UP! The thread for (row,col+1) will read memory cell (col + 1) x stride + row which is NOT next to col x stride + row in memory!

Therefore, your little change to column-major order broke the most important thing to optimize in nVidia GPU's!

2
votes

Option 0 - This isn't so bad if it keeps the code simple and your current performance is good enough.

Option 1 - I think this is worth a try. You want to load 4 bytes as a single int, and process it with the single thread. This ALU saturation is exactly what your scheduler needs to hide the global memory latency your are experiencing. I think this is a very close 2nd place to option #2.

Option 2 - Likely the best one that you have mentioned because it will take advantage of memory broadcasting available on many modern devices. Each int value would be read once per 4 threads. I think it is worth testing the performance when processing more than 1 int per 4 threads though (maybe 4 ints per 4 threads, for 16 bytes total).

Option 3 - This seems to be the natural extension to option #1. If you're going to give option 1 a shot, mapping the values to vectors is the next logical thing to test out. Possibly no performance gain for every architecture though -- GPUs love floats, doubles and ints, not necessarily bytes.

More Ideas/comments:

I think the biggest optimization for your global access performance is the column-major ordering you have already implemented.

Have you though of using half and halfn types? For devices which support half, you should be able to get double the data density over float/floatn. This isn't quite as good as 4 bytes packed as an int or char4, but any device supporting the half type will likely support dot(halfn,halfn) and this could get you computing 4, 8, or 16 MADs at a time.

Option 4 - I highly recommend reading much larger blocks into local memory. When you multiply 32x32 matrices from local memory, each element is read 32 times, but only once from global memory. When you do the same with 64x64 blocks, the elements are read 64 times each from local memory. OpenCL devices have 32KB of shared memory, and when you have three 32x32 byte matrices, you only use 3KB.

If you like to use square blocks: 3 * 64x64 bytes = 12KB, 3 * 96x96 = 27KB

If you prefer to work on 32x32 of the output matrix 'C':

blockDim = ((32768 - 32*32) /2 )/32 = 496
1) read 496x32 block from A, store locally
2) read 496x32 block from B, store locally
3) read or initialize 32x32 block of C in local memory
4) do the math
5) write the 32x32 block to global memory C

496 is larger than most work group dimensions allow, but I personally prefer using 32x1 work items and looping through the data anyway.