1
votes

I am studying an OpenCL code wich simulates the N-body problem from the following tutorial :

http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html

My main issue relies on the kernel code :

   for(int jb=0; jb < nb; jb++) { /* Foreach block ... */

19          pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
20          barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */

21          for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
22             float4 p2 = pblock[j]; /* Read a cached particle position */
23             float4 d = p2 - p;
24             float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
25             float f = p2.w*invr*invr*invr;
26             a += f*d; /* Accumulate acceleration */
27          }

28          barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
29       }

I don't understand what exactly happens at the execution : the kernel code is executed n times where n is the number of work-items (which is also the number of threads) but in the above part of code, we use the local memory for each work-group (there are nb work-groups it seems)

So, at the execution, up to the first "barrier", do I fill locally the pblock array with the global values of pos_old ?

Always up to the first barrier, for another work-group, the pblock array will have contain the same values as the arrays of the others work-groups, since jb=0 before the barrier ?

It seems that's a way to share these arrays by all the work-groups but this is not totally clear for me.

Any help is welcome.

1

1 Answers

0
votes

Can you post the entire kernel code please? I have to make assumptions about the params and private variables.

It looks like there are nt number of work items in the group, and ti represents the current work item. When the loop executes, each item in the group will copy only single element. Usually this copy is from a global data source. The first barrier forces the work item to wait until the other items have made their copy. This is necessary because every work item in the group needs to read the data copied from every other work item. The values should not be the same, because ti should be different for each work item. (jb*nt would still equal zero for the first loop though)


Here is the entire kernel code :

__kernel
void 
nbody_sim(
    __global float4* pos ,
    __global float4* vel,
    int numBodies,
    float deltaTime,
    float epsSqr,
    __local float4* localPos,
    __global float4* newPosition,
    __global float4* newVelocity)
{
    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    float4 myPos = pos[gid];
    float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

    for(int i = 0; i < numTiles; ++i)
    {
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        {
            // Calculate acceleartion caused by particle j on particle i
            float4 r = localPos[j] - myPos;
            float distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            float invDist = 1.0f / sqrt(distSqr + epsSqr);
            float invDistCube = invDist * invDist * invDist;
            float s = localPos[j].w * invDistCube;

            // accumulate effect of all particles
            acc += s * r;
        }

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    float4 oldVel = vel[gid];

    // updated position and velocity
    float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    float4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVelocity[gid] = newVel;
}

There are "numTiles" work-groups with "localSize" work-items for each work-group.

"gid" is the global index and "tid" is the local index.

Let's start at the first iteration of the loop "for(int i = 0; i < numTiles; ++i)" with "i=0":

If I take for example :

numTiles = 4, localSize = 25 and numBodies = 100 = number of work-items.

Then, at the execution, if I have gid = 80, then tid = 5, idx = 5 and the first assignement will be : localPos[5] = pos[5]

Now, I take gid = 5, then tid = 5 and idx = 5, I will have the same assignement with : localPos[5] = pos[5]

So, from what I understand, in the first iteration and after the first "barrier", each work-items contains the same Local array "localPos", i.e the sub-array of the first global block, which is "pos[0:24]".

Is this a good explanation of what happens ?