How should I view global and local work sizes

Question

I've been using OpenCL for a little while now for hobby purposes. I was wondering if someone could explain how i should view global and local work spaces. I've been playing around with it for a bit but i cannot seem to wrap my head around it.

I have this piece of code, the kernel has a global work size of 8 and the local work size of 4

    __kernel void foo(__global int *bar)
    {    
        bar[get_global_id(0)] = get_local_id(0);
    }

This result in bar looks like this:

{0, 1, 2, 3, 0, 1, 2, 3, 4}

I know why it is happening because of the work sizes I've used. But i can't seem to wrap my head around how i should view this.

Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?

Or does this mean the main body of the kernel just has two counters? one from local and one global but what is the point of that?

I know i might be a bit vague and my question might seem dumb. But i don't know how i can use this more optimally and how i should view this?

pmdj pmdj · Accepted Answer · 2020-06-12T13:58:26

Global size is the total number of work items.
Work groups subdivide this total workload, and local size defines the size of each group within the global size.

So for a global work size of 8 and a local size of 4, each in 1 dimension, you will have 2 groups. Your get_global_id(0) will be different for each thread: 0…7. get_local_id(0) will return 0…3 for the 4 different threads within each group. This is what you're seeing in indices 0 through 7 of your output.

This also means that if your global work size is 8, only the first 8 items of bar will be set by your kernel. So anything beyond that (the value 4 at index 8 in your output) is undefined.

Does this mean that there are 4 threads working locally and 8 globally so i have 4 * 8 threads running in total? and if so what makes those 4 working locally special?

You're overthinking it. There are 8 threads in total. They are subdivided into 2 groups of 4 threads. What is "local" about the threads in those groups is that they share access to the same local memory. Threads which are not in the same group can only "communicate" via global memory.

Using local memory can hugely improve efficiency for some workloads:

It's very fast.
Threads in a work group can use barriers to ensure they are in lock-step, i.e. they can wait for one another to guarantee another thread has written to a specific local memory location. (Threads in different groups cannot wait for each other.)

But:

Local memory is small (typically a few KiB) - and using all of it in one group usually has further efficiency penalties.
Local memory must be filled with data inside the kernel, and its contents is lost when the kernel completes. (Except for device-scheduled kernels in OpenCL 2)
There are tight limits on group size due to hardware limitations.

So if you are not using local memory, work groups and therefore local work size are essentially irrelevant to you.

How should I view global and local work sizes

1 Answers