3
votes

When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA".

My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?

for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
    sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
        multithreadBodies ?
        positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
        positions[WRAP(blockIdx.x + tile,                   gridDim.x) * p + threadIdx.x];  //this line

    __syncthreads();

    // This is the "tile_calculation" function from the GPUG3 article.
    acc = gravitation(bodyPos, acc);

    __syncthreads();
}

isn't it supposed to be like this according to paper? I wonder why

    sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
        multithreadBodies ?
        positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
        positions[WRAP(tile,                   gridDim.x) * p + threadIdx.x];

Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y

if (multithreadBodies)
{
    SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
    SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
    SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line

    __syncthreads();

    // Save the result in global memory for the integration step
    if (threadIdx.y == 0)
    {
        for (int i = 1; i < blockDim.y; i++)
        {
            acc.x += SX_SUM(threadIdx.x,i).x; //this line
            acc.y += SX_SUM(threadIdx.x,i).y; //this line
            acc.z += SX_SUM(threadIdx.x,i).z; //this line
        }
    }
}

Can anyone explain this to me? Is it some kind of optimization for faster code?

2

2 Answers

5
votes

I am an author of this code and the paper. Numbered answers correspond to your numbered questions.

  1. The blockIdx.x offset to the WRAP macro is not mentioned in the paper because this is a micro-optimization. I'm not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without the blockIdx.x offset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.

  2. The sum is across threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to use threadIdx.x to index the right column of the (conceptually 2D) shared memory array.

To Answer SystmD's question in his (not really correct) answer, gridDim.y is only 1 in the (default/common) 1D block case.

0
votes

1) the array SharedPos is loaded in the shared memory of each block (i.e. each tile) before synchronization of the threads of each block (with __syncthreads()). blockIdx.x is the index of the tile, according to the algorithm.

each thread (index threadIdx.x threadIdx.y) loads a part of the shared array SharedPos. blockIdx.x refers to the index of the tile (without multithreading).

2) acc is the float3 of the body index blockIdx.x * blockDim.x + threadIdx.x (see the beginning of the integrateBodies function)

I found some trouble with multithreadBodies=true during this sum with q>4 (128 bodies,p =16,q=8 gridx=8) . (with GTX 680). Some sums were not done on the whole blockDim.y ...

I changed the code to avoid that, It works but I don't know really why...

if (multithreadBodies)
{
    SX_SUM(threadIdx.x, threadIdx.y).x = acc.x;
    SX_SUM(threadIdx.x, threadIdx.y).y = acc.y;
    SX_SUM(threadIdx.x, threadIdx.y).z = acc.z;

    __syncthreads();


        for (int i = 0; i < blockDim.y; i++) 
        {
            acc.x += SX_SUM(threadIdx.x,i).x;
            acc.y += SX_SUM(threadIdx.x,i).y;
            acc.z += SX_SUM(threadIdx.x,i).z;
        }

}

Another question: In the first loop:

for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++) 
{
}

I don't know why blockIdx.y is used since grid.y=1.

3) For a faster code, I use asynchronous H2D and D2D memory copies (my code only uses the gravitation kernel).