1
votes

I'm doing a research about gpu in cluster environments using mpi to communicate.
In order to compare speed up, I think in create:

A Multiplication of matrix just for GPU, ok.
Now just CPU MatrixMulti, ok.
But I can't find a nice implementation of CUDA + MPI matrix multiplication.

Anyone have some hint about where I can fin this? Or suggest one implementation.

3
My env with mpich2 is ready to use, so I'd prefer than openmpCustodio

3 Answers

1
votes

The MTL4 Matrix Template Library can be a great starting point. Right now MTL4 has multi-core, DMM, and we are almost done with a full GPU implementation. Peter and I have been talking about distributed GPU algorithms, but since our focus is driven by PDE solvers for the moment, distributed GPU algorithms are difficult to make competitive against robust DMM.

However, I am working on a new geophysics/medical imaging solver set that is more conducive for distributed GPU computes as the data sets are more modest and the video capabilities of the GPU are beneficial.

To get started, take a look at the MTL4 tutorial

1
votes

there is not much around. Your best bet is actually write a block matrix multiplication over MPI had have each node do the block multiplication locally on GPU.

0
votes

The Combinatorial BLAS is a templated C++ MPI code that has a sparse matrix-matrix multiply operation. It uses a sqrt(p)-by-sqrt(p) processor grid and the SUMMA algorithm for matrix multiplication. One of the template arguments is a "sequential" component which is the matrix local to one process. You may be able to use it directly with a finnagled template argument that's your CUDA structure, but at least it can serve as a reference for your own code.