I'm learning to use opencl. Now my task is very simple, copy one large array to another one. Let's say a[301][300][300] to b[301][300][300]. It's just a test to make me understand what's global work size and local work size. And I use SVM to pass float8 vector array to kernel.
__global float8* dts,
__global float8* dts_from_file
1. It seems I have to choose global work size > the array size, in my test case
size_t globalWorkSize[3] = { 128, 128, 256 };
(128*128*256*8)>301*300*300. Otherwise, I get truncated output. Am I right or just confused about the definition of the global work size? FYI,
CL_DEVICE_ADDRESS_BITS=64
CL_DEVICE_MAX_WORK_GROUP_SIZE=256
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS=3
CL_DEVICE_MAX_WORK_ITEM_SIZES[0,1,2]=256, 256, 256
2. Is the local work size limited by the CL_KERNEL_WORK_GROUP_SIZE=256 ?
size_t localWorkSize[3] = { 4,8,8 };
As far as I change 4 to larger value, there will be clEnqueueNDRangeKernel error CL_INVALID_WORK_GROUP_SIZE because 4*8*8=256?
3. What about the global/local work size for multiple devices (CPU+GPU), do I need to specify different work size for each device?
Thanks in advance.