OpenCL large global size or for loops per work item?

Question

I'm learning OpenCL in order to implement a relatively complex image processing algorithm which includes several subroutines that should be implemented as kernels.

The implementation is intended to be on Mali T-6xx GPU.

I read the "OpenCL Programming by Example" book and the "Optimizing OpenCL kernels on the Mali-T600 GPUs" document.

In the book examples they use some global size of work items and each work item processes several pixels in for loops.

In the document the kernels are written without loops as in there is a single execution per work item in the kernel.

Since the maximum global size of work items that are possible to spawn on the Mali T-600 GPUs are 256 (and thats for simple kernels) And there are clearly more pixels to process in most images, in my understanding the kernel without loops will spawn more work item threads as soon as possible until the global size of work items completed executing the kernel and the global size might just be the amount of pixels in the image. Is that right? Such that it is a kind of a thread spawning loop in itself?

On the other hand in the book. The global work size is smaller than the amount of pixels to process, but the kernel has loops that make each work item process several pixels while executing the kernel code.

So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one way might be better than the other, assuming I understood correctly both ways...

solidpixel solidpixel · Accepted Answer · 2016-05-08T20:03:43

Is that right? Such that it is a kind of a thread spawning loop in itself?

Yes.

So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one

I suspect there isn't a "right" answer in general - there are multiple hardware vendors and multiple drivers - so I suspect the "best" approach will vary from vendor to vendor.

For Mali in particular the thread spawning is all handled by hardware, so will in general be faster than explicit loops in the shader code which will take instructions to process.

There is normally some advantage to at least some vectorization - e.g. processing vec4 or vec8 vectors of pixels per work item rather than just 1 - as the Mali-T600/700/800 GPU cores uses a vector arithmetic architecture.

OpenCL large global size or for loops per work item?

1 Answers