1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) {
pblock[ti] = pos_old[jb*nt+ti];
barrier(CLK_LOCAL_MEM_FENCE);
for(int j=0; j<nt; j++) {
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.