How to use pdgemr2d to copy distributed matrix in total to all processes?

Question

I'd like to copy a (nxn) matrix, which is distributed over a (pxq) grid of processes to all processes, so that each process has the whole (nxn) matrix, similar to an allgather operation from mpi.

I understand that scalapacks pdgemr2d routine is the way to go, but examples and documentation did not help me to figure it out. My idea was to introduce a second blacs context, which consists of only one process, which is also mpi_root. pdgemr2d copies all information to this 1x1 grid. mpi_root then bcasts to all other processes.

I am using the fortran interface of scalapack/blacs.

Here come a bunch of questions:

Is my idea stated above sane or is there a (canonical) way with better performance?
There are a lot of contexts in this context and I do not fully understand, if I separate them correctly: All of my pxq processes are in the MPI_WORLD_COMMUNICATOR, this communicator is also used as the blacs context for the grid. Root is then part of MPI_WORLD, the grid-context and the 1x1-context. So it has a chunk of data which also should be send somehow from the pxq-context to the 1x1-context. Is this correct and does this even work?
The last argument of pdegemr2d is ictxt, which shall be the context-unification of all participating processes, is this MPI_WORLD?
Do I need different calls for the members of the pxq-grid and the one member of the 1x1-grid? And if so, what shall be the difference?

Code is probably not strictly necessary, but it could make your points much clearer, if you have some. — Vladimir F
Seems I figured everything out by myself. Will provide answer (with code)! — chris
Seems I did not get it to work, see follow up question here link — chris

kk. kk. · Accepted Answer · 2015-05-20T19:07:54

Check out this tutorial, which I found super useful when just starting to use ScaLAPACK: https://www.sharcnet.ca/help/index.php/LAPACK_and_ScaLAPACK_Examples

Also, you will eventually run into the 32 bit integer problem when using pdgemr2d for matrices with more than 2^31 elements - it will crash with the warning "xxmr2d: out of memory". That's due to a global array index declared as a C int, so it explodes when the array gets larger than 2**31. The fix is to replace pdgemr2d with your own scatter and gather routines that honor the block cyclic matrix distribution used by scalapack. I wrote my own Fortran code based on a C example I found online. So far i've tested it for scalapack dense matrix multiplication (pdsyrk) for a 100,000 x 100,000 matrix and it worked fine. Took about 520 s on 320 cores connected with QDR InfiniBand.

-Kerry

How to use pdgemr2d to copy distributed matrix in total to all processes?

1 Answers