I'm learning OpenCL in order to implement a relatively complex image processing algorithm which includes several subroutines that should be implemented as kernels.
The implementation is intended to be on Mali T-6xx GPU.
I read the "OpenCL Programming by Example" book and the "Optimizing OpenCL kernels on the Mali-T600 GPUs" document.
In the book examples they use some global size of work items and each work item processes several pixels in for loops.
In the document the kernels are written without loops as in there is a single execution per work item in the kernel.
Since the maximum global size of work items that are possible to spawn on the Mali T-600 GPUs are 256 (and thats for simple kernels) And there are clearly more pixels to process in most images, in my understanding the kernel without loops will spawn more work item threads as soon as possible until the global size of work items completed executing the kernel and the global size might just be the amount of pixels in the image. Is that right? Such that it is a kind of a thread spawning loop in itself?
On the other hand in the book. The global work size is smaller than the amount of pixels to process, but the kernel has loops that make each work item process several pixels while executing the kernel code.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one way might be better than the other, assuming I understood correctly both ways...