OpenCL work-items per work-groups in a 2DRange

Question

This code here represents the matrix multiplication, the code is written using OpenCL. The size of the three matrixes (2 in - 1out) is 1024x1024.

Talking about OpenCL implementation the range of execution is bidimensional so we have 1024x1024 work-groups, each of whom is composed by 16x16 work-items.

The question is, why should we set the size of each work-group since in the kernel we are neither using local memory nor get_local_id() calls? Setting to null the work-groups dimension wouldn't be better so that each work-group works on filling each cell of the output matrix?

To me, reading the kernel code (at the bottom of the page I linked), it seems like each work-group is ready to work with 16x16 work-items but at the end they remain unused. I would set the local size to NULL. Why do they use 16x16, what does improve? I'm very confused.

Prashant Ravi Prashant Ravi · Accepted Answer · 2016-01-18T02:29:38

The local work size is set in the clEnqueueNDRange just to tell the OpenCL device to try grouping work items from the global work group for utilizing parallelism. If you keep this as null then OpenCL will select an appropriate local work size and execute your kernel. Ideally, the global work size should be divisible by the local work size and the local work size should be an integral multiple of the number of computing units on your device. So even if you give your local work size as null I guess OpenCL will still choose an appropriate value and launch computations in parallel. I have seen scenarios where giving local work size as null makes my code go faster than actually giving a value there.

P.S.: - I'm still a rookie in OpenCL, forgive me for any minor mistakes.

OpenCL work-items per work-groups in a 2DRange

2 Answers