I'm doing a research about gpu in cluster environments using mpi to communicate.
In order to compare speed up, I think in create:
A Multiplication of matrix just for GPU, ok.
Now just CPU MatrixMulti, ok.
But I can't find a nice implementation of CUDA + MPI matrix multiplication.
Anyone have some hint about where I can fin this? Or suggest one implementation.