OpenCL Small constant memory size on Nvidia GPU

Question

I have the following code for matrix multiplication, abbreviated for simplicity. I plan to use local memory that is block_size*block_size to hold a block of sub-matrix. I keep getting error code -52 in clEnqueueNDRangeKernel when I run it on NVIDIA GPU. And after some research, I found out the constant memory size on NVIDIA gpu is extremely small.

host:

    cl::Buffer a_buf{ context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, a.bytes(), a.data };
    cl::Buffer b_buf{ context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, a.bytes(), bT.data };
    cl::Buffer result_buf{ context, CL_MEM_READ_WRITE , result.bytes(), nullptr }; //for memory mapping
    kernel.setArg(0, a_buf);
    kernel.setArg(1, b_buf);
    kernel.setArg(2, local_size*local_size* sizeof(float), nullptr);
    kernel.setArg(3, local_size*local_size* sizeof(float), nullptr);
    kernel.setArg(4, result_buf);
    queue.enqueueNDRangeKernel(kernel, { 0,0 }, { a.rows, a.rows }, {local_size, local_size});
                                        //  ^ offset   ^global work size  ^local work size

Kernel:

__kernel void matrixMul(__constant float* a,
    __constant float* b,    //storing the original matrix data
    __local float* a_local, 
    __local float* b_local, //storing a sub-matrix block for the work-group
    __global float* result)
 {...}

Using CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, my RX580 returns almost all available VRAM, but my GTX1650 returns only 64KB. I indeed get a significant performance boost from my RX580 when using __constant instead of __global. Is there anything I did wrong, or it happens to be the case so that I need to prepare different kernels to run on AMD and NVIDIA gpus?

EDIT: I found a relevant issue on github here So I changed __constant float* a -> __global const float* restrict a, it works.

ProjectPhysX ProjectPhysX · Accepted Answer · 2020-07-27T07:22:31

The constant memory size on NVIDIA GPUs is indeed very small at 64KB (I checked for Titan Xp, GTX 960M, RTX 2080 Ti, Tesla K20c, Tesla K40m). On the AMD Radeon VII, constant memory size is much larger at 14GB. On Intel CPUs (i7-8700K, Xeon E5-2680 v2) constant memory size is 128KB. This is a driver limitation and the workaround is to use global const float* restrict (and restrict for all other kernel arguments of type float*) instead of constant float*, as you already figured out.

If performance on AMD GPUs is vastly different, you may use different kernel declarations for AMD and NVIDIA GPUs. You can switch between them either via #ifdef or at runtime via string concatenation before the OpenCL code is compiled; this way you don't have to have the entire kernel twice in your code but only the line with the kernel parameter declarations.

OpenCL Small constant memory size on Nvidia GPU

1 Answers