I have programmed a matrix-matrix multiplication successfully on a single node, and now my aim is to link that program to execute in parallel on clusters nodes.
The main work modifies the code from source code of Scalapack Netlib with change the original code ( of ScaLAPACK) with part calculate matrix-matrix multiplication (in this case dgemm_) by my program (mydgemm).
In here, the original code is C program, but all routine in that program call Fortran routine (like dgemm_ is Fortran language), and my program (mydgemm) is C program.
After I modify, I can execute successful with a single node with any size of the matrix, but when I run with 4 nodes (with the size of matrix larger than 200) -> It has an error about communication data between node (MPI).
This is an error:
*BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
PID 69754 RUNNING AT localhost.localdomain
EXIT CODE: 11
CLEANING UP REMAINING PROCESSES
YOU CAN IGNORE THE BELOW CLEANUP MESSAGES*
I just use MPI in the main function to create matrix random at each node ( attaching following) - with routine is called new_pdgemm (...). (I modified code inside new-pdgemm).
Inside mydgemm.c, I use OMP to parallel and this code executed on the kernel.
Could give me a guide or idea to solve my problem?
Do you think the problem because Fortran is column major, but C is row major?
Or do I need to change
mydgemm.cbymydgemm.f( it's really hard and maybe I can't do it)?
My code:
int main(int argc, char **argv) {
int i, j, k;
/************ MPI ***************************/
int myrank_mpi, nprocs_mpi;
MPI_Init( &argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank_mpi);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs_mpi);
/************ BLACS ***************************/
int ictxt, nprow, npcol, myrow, mycol,nb;
int info,itemp;
int _ZERO=0,_ONE=1;
int M=20000;
int K=20000;
int N=20000;
nprow = 2; npcol = 2;
nb=1200;
Cblacs_pinfo( &myrank_mpi, &nprocs_mpi ) ;
Cblacs_get( -1, 0, &ictxt );
Cblacs_gridinit( &ictxt, "Row", nprow, npcol );
Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol );
//printf("myrank = %d\n",myrank_mpi);
int rA = numroc_( &M, &nb, &myrow, &_ZERO, &nprow );
int cA = numroc_( &K, &nb, &mycol, &_ZERO, &npcol );
int rB = numroc_( &K, &nb, &myrow, &_ZERO, &nprow );
int cB = numroc_( &N, &nb, &mycol, &_ZERO, &npcol );
int rC = numroc_( &M, &nb, &myrow, &_ZERO, &nprow );
int cC = numroc_( &N, &nb, &mycol, &_ZERO, &npcol );
double *A = (double*) malloc(rA*cA*sizeof(double));
double *B = (double*) malloc(rB*cB*sizeof(double));
double *C = (double*) malloc(rC*cC*sizeof(double));
int descA[9],descB[9],descC[9];
descinit_(descA, &M, &K, &nb, &nb, &_ZERO, &_ZERO, &ictxt, &rA, &info);
descinit_(descB, &K, &N, &nb, &nb, &_ZERO, &_ZERO, &ictxt, &rB, &info);
descinit_(descC, &M, &N, &nb, &nb, &_ZERO, &_ZERO, &ictxt, &rC, &info);
double alpha = 1.0; double beta = 1.0;
double start, end, flops;
srand(time(NULL)*myrow+mycol);
#pragma simd
for (j=0; j<rA*cA; j++)
{
A[j]=((double)rand()-(double)(RAND_MAX)*0.5)/(double)(RAND_MAX);
// printf("A in myrank: %d\n",myrank_mpi);
}
// printf("A: %d\n",myrank_mpi);
#pragma simd
for (j=0; j<rB*cB; j++)
{
B[j]=((double)rand()-(double)(RAND_MAX)*0.5)/(double)(RAND_MAX);
}
#pragma simd
for (j=0; j<rC*cC; j++)
{
C[j]=((double)rand()-(double)(RAND_MAX)*0.5)/(double)(RAND_MAX);
}
MPI_Barrier(MPI_COMM_WORLD);
start=MPI_Wtime();
new_pdgemm ("N", "N", &M , &N , &K , &alpha, A , &_ONE, &_ONE , descA , B , &_ONE, &_ONE , descB , &beta , C , &_ONE, &_ONE , descC );
MPI_Barrier(MPI_COMM_WORLD);
end=MPI_Wtime();
if (myrow==0 && mycol==0)
{
flops = 2 * (double) M * (double) N * (double) K / (end-start) / 1e9;
/* printf("This is value: %d\t%d\t%d\t%d\t%d\t%d\t\n",rA,cA,rB,cB,rC,cC);
printf("%f\t%f\t%f\n", A[4], B[6], C[3]);*/
printf("%f Gflops\n", flops);
}
Cblacs_gridexit( 0 );
MPI_Finalize();
free(A);
free(B);
free(C);
return 0;
}