13
votes

I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.

Does it mean that work items are executed in parallel?

If so, is it possible/efficient to make 1 work group with 128 work items?

5

5 Answers

12
votes

The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.

On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.

The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.

So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.

6
votes

In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.

Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.

You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.

Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).

1
votes

The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).

0
votes

Yes, work items are executed in parallel.

To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.

Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.

0
votes

The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.