0
votes

I'm new to OpenCL and trying to understand how to optimise matrix multiplication to become familiar with the various paradigms. Here's the current code. If I'm multipliying matrices A and B. I allocate a row of A in private memory to start with (because each work item uses it), and a column of B in local memory (because each work group uses it).

1) the code is currently incorrect, unfortunately I'm struggling on how to use local work ids to get the correct code, but I can't find my mistake? I'm basing myself on http://www.cs.bris.ac.uk/home/simonm/workshops/OpenCL_lecture3.pdf but (slide 27) it seems that this is wrong as they don't make use of loc_size in their internal loop)

2) Are there any other optimisations you would suggest with this code?

 __kernel void mmul(
  __global int* C,
 __global int* A, 
 __global int* B,
   const int rA, 
   const int rB,
   const int cC, 
   __local char* local_mem) 
{ 
   int k,ty; 
   int tx = get_global_id(0); 
   int loctx = get_local_id(0); 
   int loc_size = get_local_size(0);
   int value = 0 ;
   int tmp_array[1000]; 
   for(k=0;k<rB;k++) { 
     tmp_array[k] = A[tx * cA + k] ;
   } 
   for (ty=0 ; ty < cC ; ty++) { \n" \
     for (k = loctx ; k < rB ; k+=loc_size) { 
         local_mem[k] = B[ty + k * cC] ;
     }
      barrier(CLK_LOCAL_MEM_FENCE); 
       value = 0 ; 
       for(k=0;k<rB;k+=1) {
           int i = loctx + k*loc_size;
           value += tmp_array[k] * local_mem[i]; 
     } 
   C[ty + (tx * cC)] = value; 
 } 
} 

where I set the global and local work items as follows

const size_t globalWorkItems[1] = {result_row};
const size_t localWorkItems[1] = {(size_t)local_wi_size};

local_wi_size is result_row/number of compute units (such that result_row % compute units == 0)

1
does your code sample get the correct results? have you tried with smaller matrices? what are the global and local work dimensions?mfa

1 Answers

2
votes

Your code is pretty close, but the indexing into the local memory array is actually simpler that you think. You have a row in private memory and a column in local memory, and you need to compute the dot product of these two vectors. You just need to sum row[k]*col[k], for k = 0 up to N-1:

for(k=0;k<rB;k+=1) {
    value += tmp_array[k] * local_mem[k]; 
}

There's actually a second, more subtle bug that is also present in the example solution given on the slides you are using. Since you are reading and writing local memory inside a loop, you actually need two barriers, in order to make sure that work-items writing to local memory on iteration i don't overwrite values that are being read by other work-items executing iteration i-1.

Therefore, the full code for your kernel (tested and working), should look something like this:

__kernel void mmul(
  __global int* C,
  __global int* A,
  __global int* B,
     const int rA,
     const int rB,
     const int cC,
  __local char* local_mem)
{
  int k,ty;
  int tx = get_global_id(0);
  int loctx = get_local_id(0);
  int loc_size = get_local_size(0);
  int value = 0;
  int tmp_array[1000];
  for(k=0;k<rB;k++) {
    tmp_array[k] = A[tx * cA + k] ;
  }
  for (ty=0 ; ty < cC ; ty++) {

    for (k = loctx ; k < rB ; k+=loc_size) {
      local_mem[k] = B[ty + k * cC];
    }
    barrier(CLK_LOCAL_MEM_FENCE); // First barrier to ensure writes have finished

    value = 0;
    for(k=0;k<rB;k+=1) {
      value += tmp_array[k] * local_mem[k];
    }
    C[ty + (tx * cC)] = value;

    barrier(CLK_LOCAL_MEM_FENCE); // Second barrier to ensure reads have finished
  }
}

You can find the full set of exercises and solutions that go with the slides you are looking at on the HandsOnOpenCL GitHub page. There's also a more complete set of slides from the same tutorial available here, which go on to show a much more optimised matrix multiply example that uses a blocking approach to better exploit temporal and spatial locality. The aforementioned missing barrier bug has been fixed in the example solution code, but not on the slides (yet).