2
votes

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work. I think that global work size determine how many times kernel function will be called, but local work size?

I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?

Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it. But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?

Here is some example code i found: enter image description here

and my another question is: how local size change influence that code? As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?

And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?

I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times. So how does it work? I don't get it.

Thank you in advance!

1

1 Answers

2
votes

I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?

Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.

For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.

how local size change influence that code? As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?

Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.

And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?

For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.

I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times. So how does it work?

If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.

To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.

Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.


Trivial example for a CPU:

4 core, local size = 10, global size = 100

core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.

  • 1: 30 threads --> fully performant
  • 2: 30 threads
  • 3: 20 threads --> less performant, better preemption for other jobs
  • 4: 20 threads

while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)