I write a program involves some matrix-vector multiplication and least-square solving all using cublas & cula API . The program will iterate many times . in each step I must set one matrix's particular row all to zero.
I tried to copy entire matrix(50*1000 or larger) into cpu and set one row to zero then copy matrix back, but it is too time-consuming because program will iterate 10 times or more. So I decide to write a kernel function.
The global function like this:
__global__ void Setzero(float* A, int index) /* A is the matrix and in col-major , index is the row I want to set zero */
{
int ind=blockDim.x*blockIdx.x+threadIdx.x;
if( ((ind%N)==index ) && (ind<50000) ) //notice matrix is in col-major ,matrix size is 50000
{
A[ind]=0.0;
ind+=blockDim.x*blockIdx.x;
}
else ;
__syncthreads();
}
The question is when I do this(use cublas before call the function ):
cudaMalloc((void**)&A_Gpu_trans,sizeof(float)*50000);
cudaMemcpy(A_Gpu_trans,A_trans,sizeof(float)*M*N,cudaMemcpyHostToDevice);
cublasSgemv_v2(handle,CUBLAS_OP_N,1000,50,&al,A_Gpu_trans,1000,err_gpu,1,&beta,product,1);
dim3 dimBlock(16,1);
dim3 dimGrid((50000-1)/16+1,1);
Setzero<<<dimGrid,dimBlock>>>(A_Gpu_trans,Index);
It return the error:
a __host__ function("Setzero") redeclared with __global__.
and an other error:
MSB3721: command“"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "D:\Program Files\Microsoft Visual Studio 10.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "C:\Users\Administrator\documents\visual studio 2010\Projects\OOmp\OOmp\kernel.cu"”return 2。
It is strange when I only use cublas & cula API I can get the right answer.