OpenCL for loop execution model

Question

I'm currently learning OpenCL and came across this code snippet:

int gti = get_global_id(0);
int ti = get_local_id(0);

int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;

for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
      pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */

      for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
         float4 p2 = pblock[j]; /* Read a cached particle position */
         float4 d = p2 - p;
         float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
         float f = p2.w*invr*invr*invr;
         a += f*d; /* Accumulate acceleration */
      }

      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}

Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.

Here are my questions (mainly to do with the for loops):

How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.

I hope my questions are clear enough and I appreciate any input.

I think I'm missing something here. I'm trying to understand this code snippet by itself because the link to it is dead. My problem is that I know that workgroup assignment is defined by the implementation, but this code appears to be implementing the work group (also referred to as blocks) assignment itself in the outer loop. That outer loop is driving me nuts. Can somebody comment on this? — Steven Lu

Robert Wang Robert Wang · Accepted Answer · 2016-04-13T10:05:07

1) How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).

You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.

In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:

if(condition is true)
   do_a();
else
   do_b();

logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.

Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:

for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
      pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */

      for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */

The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.

2) In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?

As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.

In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,

unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];

this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.

3) If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.

     float4 d = p2 - p;
     float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
     float f = p2.w*invr*invr*invr;
     a += f*d; /* Accumulate acceleration */

Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).

Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.

OpenCL for loop execution model

3 Answers