Cuda output varying with compute capability

Question

I am trying to run the following kernel which is similar to sparse matrix vector multiplication(SpMV) kernel.

__global__ void cost_spmv(const int *population,const int *row,const int *col,int *out){
    /*Calculate the cost vector for multiplication of the matrices*/
    //int tid=threadIdx.x+blockDim.x*blockIdx.x;
    int lane=threadIdx.x;
    extern __shared__ int val[];
    int r=blockIdx.x;
    int rowStart=row[r];
    int rowEnd=row[r+1];
    val[threadIdx.x]=0;
    for(int i=rowStart+lane;i<rowEnd;i+=32)
        val[threadIdx.x]+=  population[col[i]];
    __syncthreads();
    if(lane<16)
        val[threadIdx.x]+=val[threadIdx.x+16];
    if(lane<8)
        val[threadIdx.x]+=val[threadIdx.x+8];
    if(lane<4)
        val[threadIdx.x]+=val[threadIdx.x+4];
    if(lane<2)
        val[threadIdx.x]+=val[threadIdx.x+2];
    if(lane<1)
        val[threadIdx.x]+=val[threadIdx.x+1];
    if(lane==0)
        out[r]=val[threadIdx.x];
}

The kernel is invoked using

cost_spmv<<<numRows,32,32*sizeof(int)>>>(population,rowArray,colArray, out)

Where numRows is the size of the arrays population,out and rowArray(numRows+1 actually). rowArray[i] contains the starting index of the elements belonging to row i.The size of colArray is rowArray[numRows]. colArray[i] contains the column numbers which have non zero value for the row described using the rowArray.

However on compiling it for compute capability 3.5 on Tesla P4 I get different answer compared to what I get for compute capability 6.1. Also the answer I get using compute capability 6.1 on Tesla P4 is the same as what I get using compute capability 3.5 on 920m. What could be the reason for it?

You will need to provide a proper minimal reproducible example if you want what is effectively debugging help — talonmies
The given code attempts to use warp-synchronous behavior in a defective fashion. Try marking val as volatile: extern __shared__ volatile int val[]; However there may certainly be other defects as well. — Robert Crovella
It seems to work correctly now after marking it volatile and the answer is same as the one I get using arch=sm_61 for sm_35 on Tesla P4 however I don't understand why was the answer same for p4 on sm_61 and 920m on sm_35 but different for P4 on sm_35. — CHAITANYA BHATIA
The code as written invokes unspecified behavior. On occasion, unspecified behavior can happen to match what the programmer intended to happen. — njuffa

njuffa njuffa · Accepted Answer · 2018-02-12T08:25:43

Keep in mind that the CUDA compiler has a single-thread view of the world. It knows nothing of the run-time configuration used to execute the code, which is not available at compile time.

There are no dependencies expressed in the code between the loads of val[] and previous writes to val[]. Therefore the compiler is free to move the loads as it sees fit. In some cases it may chose to issue some or all loads early to increase the load-latency tolerance of the code, e.g. by transforming the code as follows:

int __temp0 = val[threadIdx.x+16];
int __temp1 = val[threadIdx.x+ 8];
int __temp2 = val[threadIdx.x+ 4];
int __temp3 = val[threadIdx.x+ 2];
int __temp4 = val[threadIdx.x+ 1];

if(lane<16)
    val[threadIdx.x]+=__temp0;
if(lane<8)
    val[threadIdx.x]+=__temp1;
if(lane<4)
    val[threadIdx.x]+=__temp2;
if(lane<2)
    val[threadIdx.x]+=__temp3;
if(lane<1)
    val[threadIdx.x]+=__temp4;

Depending on where the compiler chooses to place the loads, the results of the reduction sequence will differ. Code generation, and instruction scheduling in particular, in the CUDA compiler differs by GPU architecture, so different results may be observed when compiling for different GPU architectures.

To enforce the desired dependencies between loads and stores, the method sanctioned by the CUDA programming model is to use __syncthreads() after every reduction step to create a barrier. The potentially faster, but hacky way to achieve the desired outcome is to declare val alterable by agents outside the scope of the code through use of the volatile modifier. This prevents the compiler from moving around the loads from val[].

Cuda output varying with compute capability

1 Answers