1
votes

In Opencl, buffers are the conduit through which data is communicated from the host application.

cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size,
                       void *host_ptr, cl_int *errcode_ret);

Now if I have a buffer a_buffer flaged as READ_ONLY, and the kernel is:

__kernel void two_buffer_double(__global float* a)
{
    int i = get_global_id(0);
    float b = a[i] * 2;
}

My question is that: is a_buffer a global memory or constant memory? Should I use __constant qualifier for a. What is the connection between cl_mem_flags(READ_ONLY and READ_WRITE) and memory qualifier(global and constant)?

1
added some more info about "host side" part.huseyin tugrul buyukisik

1 Answers

2
votes
__constant

qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.

__global __read_only * float

means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.

These qualifiers are for device-side optimization.

At host-side optimization, you should supply it with a

CL_MEM_READ_ONLY 

as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.

__constant

is for parametric constant definitions, not for data to be processing.

If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.

Now answering your question:

" is a_buffer a global memory or constant memory? "

it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).

Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.

An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:

Query                                           value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE                 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                   262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE                     1636414260 bytes

CL_DEVICE_GLOBAL_MEM_CACHE_TYPE               CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE                      65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE                      CL_LOCAL(so is faster than global)  

you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.