__constant
qualifier is used for constant memory and some cards get it in texture cache and get independent bandwidth from __global but is very limited in size.
__global __read_only * float
means, opencl implementation will try put it in cache(or use some other data path) if hardware sees fit but it is __global so is limited by only vram size or its fraction instead of just 64kB(for example) for __constant.
These qualifiers are for device-side optimization.
At host-side optimization, you should supply it with a
CL_MEM_READ_ONLY
as flag for buffer creation. This means device will only read from it(probably using some DMA/pcie access/caching optimizations) but can be written from host side(as being a host like C# C++ code, not device) using enqueuewrite or map unmap operations.
__constant
is for parametric constant definitions, not for data to be processing.
If you are writing a filter code, data could be __global and filter mask could be __constant if that cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower than private) so accessing mask bytes does not decrease data bandwidth.
Now answering your question:
" is a_buffer a global memory or constant memory? "
it is global for device side(kernel side) because you declared it as __global but it could be anywhere on host side(hardware).
Edit: for host-side, depends which other flags are used, for example, USE_HOST_PTR makes it directly-accessible from system RAM and there is only a virtual buffer on device side, without it and with just a CL_MEM_READ_WRITE device memory will have a real buffer and its mapped shadow in RAM (as a sub-step for clenqueueread or clenqueuewrite) and copying will visit this shadow first then uploaded to gpu.
An example device: Intel(R) HD (TM) GRAPHICS 400 in a 4GB DDR3L laptop:
Query value
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE 65536 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE 262144 bytes
CL_DEVICE_GLOBAL_MEM_SIZE 1636414260 bytes
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE CL_READ_WRITE_CACHE
CL_DEVICE_LOCAL_MEM_SIZE 65536(vs constant, benchmark it)
CL_DEVICE_LOCAL_MEM_TYPE CL_LOCAL(so is faster than global)
you cannot query private memory size but for a mid-segment gaming amd card, it is 256kB per thread group. If you set 64 threads per group, it can use 4kB register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling to global memory.