I am using a Tesla C1060 with 1.3 compute capability and nvcc compiler driver 4.0. I am trying to do some computation local to thread block. Each thread block is provided with a shared array which is first initialized to zero values. For synchronizing concurrent updates (addition) to shared data by threads of the thread block, I use CUDA atomicAdd primitive.
Once each thread block is ready with the results in its shared data array, each entry in shared data array is iteratively merged (using atomicAdd) to corresponding entries in global data array.
Below is a code very similar to what I basically trying to do.
#define DATA_SZ 16
typedef unsigned long long int ULLInt;
__global__ void kernel( ULLInt* data, ULLInt ThreadCount )
{
ULLInt thid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ ULLInt sharedData[DATA_SZ];
// Initialize the shared data
if( threadIdx.x == 0 )
{
for( int i = 0; i < DATA_SZ; i++ ) { sharedData[i] = 0; }
}
__syncthreads();
//..some code here
if( thid < ThreadCount )
{
//..some code here
atomicAdd( &sharedData[getIndex(thid), thid );
//..some code here
for(..a loop...)
{
//..some code here
if(thid % 2 == 0)
{
// getIndex() returns a value in [0, DATA_SZ )
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
}
}
}
__syncthreads();
if( threadIdx.x == 0 )
{
// ...
for( int i = 0; i < DATA_SZ; i++ ) { atomicAdd( &Data[i], sharedData[i] ); }
//...
}
}
If I compile with -arch=sm_20 I dont get any errors. However when I compile the kernel using -arch=sm_13 option I get the following errors:
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error : Global state space expected for instruction 'atom'
ptxas fatal : Ptx assembly aborted due to errors
If I comment the following two lines I dont get any errors with -arch=sm_13:
atomicAdd( &sharedData[getIndex(thid), thid );
atomicAdd( &sharedData[getIndex(thid)], thid * thid );
Can someone suggest what I might be doing wrong ?