1
votes

Sorry for my english. I have a cuda kernel which returns different result values from time to time. This kernel counts series sum. My kernel consists of 4 code parts. Let me explain a little how this kernel works. The first part distributes iterations between threads(I took it as source). The second code part shows how every thread counts halfsum. After the second part we must place __syncthreads() because after the second part we are starting to use shared memory. In the third part I'm getting the resulting sum of all threads in block and putting it to the thread which threadIdx.x equals 0(I took it as source @ page 22). In the fourth part Im getting the resulting sum of all thread blocks and putting it to dSum[0]

Did I place __syncthreads() correctly? Where is an error? why on 64 blocks and 768 threads it gives wrong result and on 768 blocks and 64 threads it gives correct result?

__global__ void sumSeries(double* dSum,int totalThreadNumber){  
    volatile  __shared__ double data[768];  
    int tid=threadIdx.x+blockIdx.x*blockDim.x;
    int myend;
    double var;
    //part_1 get tid's start iteration value and end iteration value.
    int mystart = (INT_MAX / totalThreadNumber) * tid;
    if (INT_MAX %  totalThreadNumber > tid)
    {
        mystart += tid;
        myend = mystart + (INT_MAX /  totalThreadNumber) + 1;
    }
    else
    {
        mystart += INT_MAX %  totalThreadNumber;
        myend = mystart + (INT_MAX /  totalThreadNumber);
    }
    //part_2 get halfsum
    data[threadIdx.x]=0;
    for (int i = mystart ; i < myend ; ++i){
            var=i;
            data[threadIdx.x] += (var*var+var+1)/(var*var*var+var*var+var+1);

    }   
    __syncthreads();

    //part_3 sum all results in every block
    for (int s=blockDim.x/2; s>32; s>>=1)
    {
        if (threadIdx.x < s)
            data[threadIdx.x] += data[threadIdx.x + s];
        __syncthreads();
    }
    if (threadIdx.x < 32)
    {
        data[threadIdx.x] += data[threadIdx.x + 32];
        data[threadIdx.x] += data[threadIdx.x + 16];
        data[threadIdx.x] += data[threadIdx.x + 8];
        data[threadIdx.x] += data[threadIdx.x + 4];
        data[threadIdx.x] += data[threadIdx.x + 2];
        data[threadIdx.x] += data[threadIdx.x + 1];
    }

    if (threadIdx.x==0)
    {
        dSum[blockIdx.x]=data[0];
    }
    __syncthreads();
    //part_4
    if (tid==0)
        for (int t=1;t<8;++t)
            dSum[0]=dSum[0]+dSum[t];
}
1
are you doing proper cuda error checking? What happens when you run your code with cuda-memcheck ? What sort of GPU are you running on, and what is your nvcc compile command line?Robert Crovella

1 Answers

3
votes

So your sum is the series over

(n^2+n+1)/(n^3+n^2+n+1) = (n^3-1)/(n^4-1)

This has the same convergence like the harmonic series over

1/n

namely none, it is, very very slowly, diverging towards infinity. The sum from 1 to N has a value between log(N) and 1-log(2)+log(N+1).

The result of any finite summation of these series is very sensible with regard to the order of summation. Summing forward from 1 to N and decreasing suppresses all terms where 1==1+1/n, which happens at a rather small number for floats. Summing backwards from some N to 1 will accumulate the small numbers first and preserve their cumulative contribution.

So depending on the order of arrival of the partial sums, especially when the sum containing 1 comes in, the total sum will show noticeable differences.


Both terms are monotonically decreasing in

f(x) = (x^2+x+1)/(x^3+x^2+x+1) = 0.5/(x+1)+0.5*(x+1)/(x^2+1)

The anti-derivative of that function is

F(n) = 0.5*ln(x+1)+0.25*ln(x^2+1)+0.5*arctan(x)

so that

f(n+1) <= F(n+1)-F(n) <= f(n) <= F(n)-F(n-1)

summing this up results in

F(N+1)-F(m) <= sum(n=m to N) f(n) <= F(N)-F(m-1)

To this one has to add the initial part of the sum in all three of the terms.

So set m=1000, computeS=sum(n=0 to 999) f(n)`, then

S+F(2^32  )-F(1000) = 23.459829390459243
S+F(2^32-1)-F( 999) = 23.460829890558995

are the upper and lower bounds for a summation from 0 to 2^32-1, far away from any of the numerical results.