1
votes

I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:

As can be seen for the GPU, the first test has 5120 work-groups, with 1 work-item each. This means that the threads which are executed in parallel are limited to the amount of computing units there are in the machine. For example if a GPU has 20 computing units there can only be a maximum of 20 threads which are working in parallel. Though when the local size is increased to 2, twice the amount of threads are run simultaneously

From reading some info on OpenCl, it seems about right. Though I need a second opinion.

1

1 Answers

1
votes

update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".

As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.

In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.

So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.