I am trying to run the following kernel which is similar to sparse matrix vector multiplication(SpMV) kernel.
__global__ void cost_spmv(const int *population,const int *row,const int *col,int *out){
/*Calculate the cost vector for multiplication of the matrices*/
//int tid=threadIdx.x+blockDim.x*blockIdx.x;
int lane=threadIdx.x;
extern __shared__ int val[];
int r=blockIdx.x;
int rowStart=row[r];
int rowEnd=row[r+1];
val[threadIdx.x]=0;
for(int i=rowStart+lane;i<rowEnd;i+=32)
val[threadIdx.x]+= population[col[i]];
__syncthreads();
if(lane<16)
val[threadIdx.x]+=val[threadIdx.x+16];
if(lane<8)
val[threadIdx.x]+=val[threadIdx.x+8];
if(lane<4)
val[threadIdx.x]+=val[threadIdx.x+4];
if(lane<2)
val[threadIdx.x]+=val[threadIdx.x+2];
if(lane<1)
val[threadIdx.x]+=val[threadIdx.x+1];
if(lane==0)
out[r]=val[threadIdx.x];
}
The kernel is invoked using
cost_spmv<<<numRows,32,32*sizeof(int)>>>(population,rowArray,colArray, out)
Where numRows
is the size of the arrays population,out and rowArray(numRows+1 actually)
. rowArray[i]
contains the starting index of the elements belonging to row i.The size of colArray is rowArray[numRows]
. colArray[i]
contains the column numbers which have non zero value for the row described using the rowArray
.
However on compiling it for compute capability 3.5 on Tesla P4 I get different answer compared to what I get for compute capability 6.1. Also the answer I get using compute capability 6.1 on Tesla P4 is the same as what I get using compute capability 3.5 on 920m. What could be the reason for it?
val
asvolatile
:extern __shared__ volatile int val[];
However there may certainly be other defects as well. – Robert Crovella