I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
- I have verified that the data going into the kernels are correct.
- I have tried to initialize both
__localmemory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly) - I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
- I am using the C++ wrapper retrieved directly from the Github page.
- To use the wrapper, I have defined
CL_HPP_MINIMUM_OPENCL_VERSION 120andCL_HPP_TARGET_OPENCL_VERSION 120. - I am compiling the kernels with the
-cl-std=CL1.2flag. - All
cl::Buffers are created with only theCL_MEM_READ_WRITEflag. - I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
- I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
- I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
- All 3 platforms exhibit the same problem with varying frequency.