Sorry for my english. I have a cuda kernel which returns different result values from time to time. This kernel counts series sum. My kernel consists of 4 code parts. Let me explain a little how this kernel works. The first part distributes iterations between threads(I took it as source). The second code part shows how every thread counts halfsum. After the second part we must place __syncthreads() because after the second part we are starting to use shared memory. In the third part I'm getting the resulting sum of all threads in block and putting it to the thread which threadIdx.x equals 0(I took it as source @ page 22). In the fourth part Im getting the resulting sum of all thread blocks and putting it to dSum[0]
Did I place __syncthreads() correctly? Where is an error? why on 64 blocks and 768 threads it gives wrong result and on 768 blocks and 64 threads it gives correct result?
__global__ void sumSeries(double* dSum,int totalThreadNumber){
volatile __shared__ double data[768];
int tid=threadIdx.x+blockIdx.x*blockDim.x;
int myend;
double var;
//part_1 get tid's start iteration value and end iteration value.
int mystart = (INT_MAX / totalThreadNumber) * tid;
if (INT_MAX % totalThreadNumber > tid)
{
mystart += tid;
myend = mystart + (INT_MAX / totalThreadNumber) + 1;
}
else
{
mystart += INT_MAX % totalThreadNumber;
myend = mystart + (INT_MAX / totalThreadNumber);
}
//part_2 get halfsum
data[threadIdx.x]=0;
for (int i = mystart ; i < myend ; ++i){
var=i;
data[threadIdx.x] += (var*var+var+1)/(var*var*var+var*var+var+1);
}
__syncthreads();
//part_3 sum all results in every block
for (int s=blockDim.x/2; s>32; s>>=1)
{
if (threadIdx.x < s)
data[threadIdx.x] += data[threadIdx.x + s];
__syncthreads();
}
if (threadIdx.x < 32)
{
data[threadIdx.x] += data[threadIdx.x + 32];
data[threadIdx.x] += data[threadIdx.x + 16];
data[threadIdx.x] += data[threadIdx.x + 8];
data[threadIdx.x] += data[threadIdx.x + 4];
data[threadIdx.x] += data[threadIdx.x + 2];
data[threadIdx.x] += data[threadIdx.x + 1];
}
if (threadIdx.x==0)
{
dSum[blockIdx.x]=data[0];
}
__syncthreads();
//part_4
if (tid==0)
for (int t=1;t<8;++t)
dSum[0]=dSum[0]+dSum[t];
}
cuda-memcheck
? What sort of GPU are you running on, and what is yournvcc
compile command line? – Robert Crovella