1
votes

I'm trying matrix multiplication using MPI and would like to ask some help to understang one issue. The machine has 6 cores, 32KB L1 cache, 256KB L2 cache and 15MB L3 cache. And multiplication goes like this:

vector<vector<double>> mult_mpi(vector<vector<double>> m, 
                                vector<vector<double>> n) { 
    int rows = m.size();
    int size = n.size();
    vector<vector<double>> r(rows, vector<double>(size));

    for (int i = 0; i < rows; ++i) 
        for (int k = 0; k < size; ++k) 
            for (int j = 0; j < size; ++j) 
                r[i][j] += m[i][k] * n[k][j];
    return r;
}

I have the same for int:

vector<vector<int>> mult_mpi(vector<vector<int>> m, vector<vector<int>> n);

Then I made some plots, different line colors indicate the number of nodes.

The following plot shows time spent to multiply two int matrices:

enter image description here

And the following plot shows time spent to multiply two double matrices:

enter image description here

Why I get the same times for 4 and 6 nodes in the double case? Am I running into the limit on the bandwidth to memory?

I tried multiple times int the last hour, same result. Also checked machine load with top but to my eyes I'm alone there.

1
I don't know the answer to your question, but just so you're aware, this is a slow implementation of matrix multiplication. - harold
Would you suggest me an alternative? Since I'm just starting, the simple the better, I guess. Would you say the time I'm getting is compatible with the hardware? - KcFnMi
Unfortunately everything that makes matrix multiplication fast also makes it more complicated.. For reference, on a Sandy Bridge E (I guess you use i7-3960X?) your DP FPop/cycle ceiling is 4 multiplies and 4 adds, you can work out a minimum time from that based on clock speed (depends on how many cores are active), core count, and matrix size, then you can see how close you are to that - harold
I also suggest Eigen, a really easy to use library for matrix algebra. - bracco23
As a comparison, on my Haswell (4770K, 3.9GHz) multiplying two 4096x4096 matrixes together on a single core takes about 1.5 seconds, and it's not optimal yet. Note that MMM is rather unique in that it does not need to be limited by memory bandwidth, since there is a cubic amount of arithmetic and only a square amount of data - but it will be limited by memory bandwidth if you don't implement tiling. - harold

1 Answers

1
votes

Are you sure that you aren't timing the allocation of 4K vector<>'s...?

vector<vector< >> is not a suitable type to squeeze optimal performance. The matrix multiplication is one of the best operation regarding scalability and "computation density" with respect to memory accesses. In fact the number of operations scale as O(N^3) while the number of data as O(N^2).

In fact it's used to benchmark top500 fastest systems on earth: HPL is for "high performance linpack", being linpack a reference implementation for linear algebral. Guess what... The operation being used in the benchmarks is DGEMM, that is "Double precision GEneral Matrix Matrix multiply".

DGEMM is the name of the operation in the BLAS library, e de-facto standard for linear algebra. Today there are many locally optimized BLAS library either commercial (INTEL MKL, IBM ESSL,...) and open source (ATLAS), but all of them use the same original (originally fortran, now C too) BLAS interface. (NOTE: the original implementation is not optimized)

Based on BLAS there are also LAPACK libraries: system solver, eigensystems,... There are also optimized lapack libraries, but usually 90% of the performance is squeezed by using optimized BLAS library.

I know very well one (not the only one... HPL is another one) powerful MPI based parallel library, which is SCALAPACK, and it contains PBLAS (parallel BLAS), and in it... an optimized and parallel version of DGEMM among other stuff.

SCALAPACK comes with the SLUG, where you can find an excellent explanation of the block-cyclic distribution, that's the data distribution strategy used to squeeze optimal performance arranging liner algebra problems on parallel systems.

To obtain optimal performance however you will need to link your MPI executable with a locally optimized BLAS library. Or write your own, but you are not alone, so don't reinvent the wheel.

Local optimization is obtained accessing matrices not by row, nor by column, but by block. With the block size being tuned in order to optimize the usage of the caches and/or of the TLB (I remember just now libgoto, another blas library, that optimized in order to minimize TLB misses, reached and surpassed on some systems Intel MKL... time ago). Find more information for example in this ATLAS paper.

In any case, if you really want to... I would start analyzing how the other wheels have been forged, before trying to make mine ;)