1
votes

I am a CUDA beginner. What I have here is a kernel which is executed by 2 threads. All threads should save their result to a shared variable. After all three finish, the result in sum should be 12 but I get 6!

__global__ void kernel (..)
{
    int i=blockDim.x*blockIdx.x+threadIdx.x;

    __shared__ double sum;

        ...

    if(i==0)
        sum=0.0;
    __syncthreads();

    if(i<=1)
        sum+= 2.0*3.0;
    __syncthreads();

    //sum should be 12 here, but I get 6. Why?
}

called by

test<<<1,2>>>(..);
1

1 Answers

9
votes

You have a memory race in your code. This:

sum+= 2.0*3.0;

potentially allows multiple threads to simultaneously accumulate to the sum. In your example both threads attempted to load and store at the same address at the same time. This is undefined behaviour in CUDA.

The usual way to avoid this problem is algorithm redesign. Just don't have multiple threads writing to the same memory location. There is a very widely described shared memory reduction technique you can use for accumulating sums from shared memory arrays without memory races.

Alternatively, there are atomic memory access primitives which can be used to serialise memory access. Your example is double precision floating point, for which I am fairly sure there is no intrinsic atomic add function. The programming guide includes an example of a user space atomic add for double precision. depending on your hardware, it may or may not be usable on a shared memory variable, as 64 bit shared memory atomic operations are only supported on compute capability 2.x and 3.x devices. In any case, atomic memory operations should be used sparingly because serialising memory access greatly reduces performance.