Random NaN and incorrect results with OpenCL kernel

Question

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.

The Kernel

I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:

__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
                   __global const float* A, __global const float* B, const float beta, 
                   __global float* C, __local float* Asub, __local float* Bsub) {

  const uint row = get_local_id(0);
  const uint col = get_local_id(1);
  const uint TS = get_local_size(0); // Tile size
  const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
  const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)

  // Initialise the accumulation register
  float acc = 0.0f;

  // Loop over all tiles
  const int numtiles = K / TS;
  for (int t = 0; t < numtiles; t++) {
    const int tiledRow = TS * t + row;
    const int tiledCol = TS * t + col;
    Asub[col * TS + row] = A[tiledCol * M + globalRow];
    Bsub[col * TS + row] = B[globalCol * K + tiledRow];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(int k = 0; k < TS; k++) {
      acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
    }

    barrier(CLK_LOCAL_MEM_FENCE);
  }

  C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}

Tile Size (TS) is now a value defined in the calling code, which looks like this:

  // A, B and C are 2D matrices, their cl::Buffers have already been set up
  // and values appropriately set.

  kernel.setArg(0, (cl_int)nrowA);
  kernel.setArg(1, (cl_int)ncolB);
  kernel.setArg(2, (cl_int)ncolA);
  kernel.setArg(3, alpha);
  kernel.setArg(4, A_buffer);
  kernel.setArg(5, B_buffer);
  kernel.setArg(6, beta);
  kernel.setArg(7, C_buffer);
  kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
  kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));

  cl::NDRange global(nrowA, ncolB);
  cl::NDRange local(nrowA, ncolB);

  status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);

The Problem

The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)

I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.

However, I can get this correct result maybe just 75% of the time.

Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.

I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.

What am I doing wrong here?

Things I have tried

I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.

Other points to note

I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.

Baiz Baiz · Accepted Answer · 2016-07-26T20:33:49

This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).

I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.

Such a behavior is a typical indicator of race conditions.

I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)

Actually this is a good thing. Finally we have some consistency.

Initializing local memory

Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:

local float* ptr = ...          // your pointer to local memory
int idx = get_local_id(0);      // get the index for the current work-item
ptr[idx] = 0.f;                 // init with value 0
barrier(CLK_LOCAL_MEM_FENCE);   // synchronize local memory access within workgroup

If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.

Your changes

The original algorithm looks like it is especially designed to use quadratic tiles.

__local float Asub[TS][TS];
__local float Bsub[TS][TS];

Not only that but the size of local memory matches the workgroup size, in their example 32x32. When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.

Update 1

Since you have not described if the original algorithm works for you, this is what you should do to find your error:

Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".