I have the following code for matrix multiplication, abbreviated for simplicity. I plan to use local memory that is block_size*block_size to hold a block of sub-matrix. I keep getting error code -52 in clEnqueueNDRangeKernel when I run it on NVIDIA GPU. And after some research, I found out the constant memory size on NVIDIA gpu is extremely small.
host:
cl::Buffer a_buf{ context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, a.bytes(), a.data };
cl::Buffer b_buf{ context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, a.bytes(), bT.data };
cl::Buffer result_buf{ context, CL_MEM_READ_WRITE , result.bytes(), nullptr }; //for memory mapping
kernel.setArg(0, a_buf);
kernel.setArg(1, b_buf);
kernel.setArg(2, local_size*local_size* sizeof(float), nullptr);
kernel.setArg(3, local_size*local_size* sizeof(float), nullptr);
kernel.setArg(4, result_buf);
queue.enqueueNDRangeKernel(kernel, { 0,0 }, { a.rows, a.rows }, {local_size, local_size});
// ^ offset ^global work size ^local work size
Kernel:
__kernel void matrixMul(__constant float* a,
__constant float* b, //storing the original matrix data
__local float* a_local,
__local float* b_local, //storing a sub-matrix block for the work-group
__global float* result)
{...}
Using CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, my RX580 returns almost all available VRAM, but my GTX1650 returns only 64KB. I indeed get a significant performance boost from my RX580 when using __constant instead of __global. Is there anything I did wrong, or it happens to be the case so that I need to prepare different kernels to run on AMD and NVIDIA gpus?
EDIT: I found a relevant issue on github here
So I changed __constant float* a -> __global const float* restrict a, it works.