I'm confused about open cl and work items. Let's say my device can run 128 work items simultaneously. However, I provide 2 work groups, each with 64 work items. Will both groups execute simultaneously, or will 64 threads sit idle as the groups are executed in serial?
1 Answers
If you enqueue a single kernel with global size 128x1x1 and a local size of 64x1x1, then there will be two work groups, which can run at the same time. Each group can be executed on a separate compute unit, so if there are two compute units on your hardware, you can run both groups in parallel.
If your local size is too big for the hardware, so there are not enough processing elements in each compute unit, then each work group will be split into subgroups. These subgroups will be executed "serially". Note that "serially" isn't necessarily the best way to describe the execution, as in reality, context switching may occur. This means that one subgroup may begin working, make a memory request, then switch to the other subgroup so that it may begin. Assuming context switching is cheap (for example, on a GPU), this can be an effective way of hiding some of the latency in accesses to global memory.