I am currently multiplying to byte matrices in an openCL kernel, using a block matrix multiplication algorithm: I subdivide the matrix into tiles (32 x 32), load those tiles into local memory, and write this back to global memory.
Currently, memory access is the bottleneck. I'm trying to see how much I can optimise it.
Let's say that I'm multiplying C = A x B where A,B,C are char*
A(Ndim,Pdim), B(Pdim,MDim), C(Ndim,MDim).
I currently have A in row major format and B in column major format to ensure that memory accesses are sequential within a work group for each matrix.
Each work item loads a single byte into the local memory, and is responsible for processing that byte. The dimensiosn for my kernel are {Ndim,Mdim} for the global work items and {block_size,block_size} for the local work items.
The code is almost identical to http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf (with the exception that A is stored in column major format)
My question: how can I optimise memory accesses? I hear a lot about coalescing, but I'm struggling to understand what the tradeoff is between coalescing and parallelism.
Option 0: Leave it as it is, even if each thread accesses a byte, this gets coalesced so every thread within a workgroup gets data that is already accessed. -> unlikely, given my accesses are not byte aligned. I suspect I end up loading every time 4 bytes + x where x is the offset of the thread.
Option 1: Using Integer Matrices Reducing Parallelism If I were to have the matrices as integers, I would be able to load much more at a time, but would significantly reduce the parallelism (by a factor of 4), where each byte multiplication would have to be performed sequentially.
Option 2: Using Integer Matrices but Keep the Parallelism the same This basically means that the data in memory will be loaded multiple times by each Intuitively, this corresponds to int foo = get_global_id(0), and then, assuming I convert foo to char[] foo_bytes having byte x = foo[get_local_id(0)); My understanding is that the first thread will use get_global_id(0) to load the data into memory, whilst the remaining thread in the work group will see it already loaded
Option 3: Using Integer Matrices, Reducing Parallelism, but using vector types within a work-item to process the data I understand that opencl supports vector types, If I load a 32bit integer, I could convert this to a vector type so that the work item would process the 4 bytes in parallel. My understanding is that this is only syntactic and that I wouldn't get any performance improvement from using vector types like that in OpenCL.
From what I understand, option 2 is preferable. Is this correct? And if not, why?