This code here represents the matrix multiplication, the code is written using OpenCL. The size of the three matrixes (2 in - 1out) is 1024x1024.
Talking about OpenCL implementation the range of execution is bidimensional so we have 1024x1024 work-groups, each of whom is composed by 16x16 work-items.
The question is, why should we set the size of each work-group since in the kernel we are neither using local memory nor get_local_id() calls? Setting to null the work-groups dimension wouldn't be better so that each work-group works on filling each cell of the output matrix?
To me, reading the kernel code (at the bottom of the page I linked), it seems like each work-group is ready to work with 16x16 work-items but at the end they remain unused. I would set the local size to NULL. Why do they use 16x16, what does improve? I'm very confused.