thread management in nbody code of cuda-sdk

Question

When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA".

My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?

for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
    sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
        multithreadBodies ?
        positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
        positions[WRAP(blockIdx.x + tile,                   gridDim.x) * p + threadIdx.x];  //this line

    __syncthreads();

    // This is the "tile_calculation" function from the GPUG3 article.
    acc = gravitation(bodyPos, acc);

    __syncthreads();
}

isn't it supposed to be like this according to paper? I wonder why

    sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
        multithreadBodies ?
        positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
        positions[WRAP(tile,                   gridDim.x) * p + threadIdx.x];

Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y

if (multithreadBodies)
{
    SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
    SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
    SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line

    __syncthreads();

    // Save the result in global memory for the integration step
    if (threadIdx.y == 0)
    {
        for (int i = 1; i < blockDim.y; i++)
        {
            acc.x += SX_SUM(threadIdx.x,i).x; //this line
            acc.y += SX_SUM(threadIdx.x,i).y; //this line
            acc.z += SX_SUM(threadIdx.x,i).z; //this line
        }
    }
}

Can anyone explain this to me? Is it some kind of optimization for faster code?

harrism harrism · Accepted Answer · 2012-09-12T02:04:58

I am an author of this code and the paper. Numbered answers correspond to your numbered questions.

The blockIdx.x offset to the WRAP macro is not mentioned in the paper because this is a micro-optimization. I'm not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without the blockIdx.x offset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.
The sum is across threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to use threadIdx.x to index the right column of the (conceptually 2D) shared memory array.

To Answer SystmD's question in his (not really correct) answer, gridDim.y is only 1 in the (default/common) 1D block case.

thread management in nbody code of cuda-sdk

2 Answers