I am trying to solve Gaussian elimination with CUDA.
I have a N*N
matrix. To get new elements of this matrix, I use the CPU code below, where C.width=N
:
for(int z=0; z< C.width-1; z++)
{
for ( int c = z+1 ; c < C.width ; c++ )
{
for (int d = z ; d < C.width ; d++ )
{
C.elements[c*C.width+d]=C.elements[c*C.width+d] - (B.elements[c*C.width+z]*C.elements[z*C.width+d]);
}
}
}
I am trying to implement it with CUDA. For example, for N=512
dim3 dimBlock(16,16,1);
dim3 dimGrid(32,32,1);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
I think for every iteration I should use N-i*N
threads to calculate the elements update, that is
if(idx>511 || idy>510)
return;
for(int i=1; i<512;i++)
{
if(idx>=i-1 && idy>=i-1)
C.elements[(idy+1)*C.width+idx]=C.elements[(idy+1)*C.width+idx]-((C.elements[(idy+1)*C.width+(i-1)]/C.elements[(i-1)*C.width+(i-1)])*C.elements[(i-1)*C.width+idx]);
__syncthreads();
}
}
The results obtained on GPU and CPU are the same, but the processing time is Time(CPU)=2*Time(GPU)
For N=512
: Time(CPU) = 1900 ms
; Time(GPU) = 980 ms
For N=1024
: Time(CPU) = 14000 ms
; Time(GPU) = 7766 ms`
.
.
.
I think the speed-up should be larger than what I have now. Is there any mistake in my parallel code? Can you help me how can I rewrite my code?
Thanks for any help!