matrix multiplication with C and OpenMP

Question

I am trying to create an efficient algorithm that can multiply matrices of large values that are of double precision. I've created the algorithm and tested it on small matrices first; after trying out i.e. A{4096x4096}, B{4096x4096} the loop takes forever to end; for those two matrices for example to produce AB took my computer more than 30 minutes to complete.

My computer is not an old slouch...it's a six-core i7 and I guess for a desktop workstation it is not that bad. On small matrices of sizes up to 1024x1024 it's completes relatively quickly, that is under 30-40 seconds and for 2048x2048 around 5 minutes... for 16384x16384 it didn't finish in 15 minutes and I stopped execution...

Am I doing something wrong or this is to be expected? :)

Thanks in advance!

The code is the following:

/* calculate */
for(travx = 0; travx < m; travx++) {
    for(travy = 0; travy < n; travy++) {
        /* we only need to calculate it ourside of Z loop */
        tIndex = (travy)+(travx*n); 
        for(travz = 0; travz < p; travz++)
            {
                if(n==1)
                    {bIndex = ((n-1)*travy)+travz;
                     aIndex = ((p)*travx)+travz;} 
                else
                    {bIndex = ((n)*travz)+travy;
                     aIndex = ((p)*travx)+travz;}

                temp = atab_ptr[aIndex]*btab_ptr[bIndex];
                outtab_ptr[tIndex] =  outtab_ptr[tIndex] + temp;
            }
    }
}

it's really straightforward... and gives great results on small matrices... no idea how you can multiply doubles under 10 secs on especially on p4... sounds kinda fishy... especially if you take in account the O(3) complexity of the problem.

Updates...based on feedback I've tweaked the code and... Well mainly I did a simplification of it and small matrix complete much faster, that is 1024x1024 is done within 3 seconds, but 4096x4096 is done in 6 minutes... the revised code is this:

for(travx = 0; travx < m; travx++) {
    for(travy = 0; travy < n; travy++) {
      for(travz = 0; travz < p; travz++)
        {outtab_ptr[travy+travx*n] = outtab_ptr[travy+travx*n] + atab_ptr[travy+p*travz] *  btab_ptr[travz+travx*p];}
    }
  }

I can multiply two 4096² matrices in less than 10s on one core of a Pentium @3.2GHz, so yes, you might be doing something wrong. — Fred Foo
your problem statement is really vague... which algorithm? how much time ("quickly", "forever" are not numbers)? single-threaded? — UmNyobe
You should post some code, or explain which library you are using (if any). — betabandido
I do not understand why you have a conditional inside the inner loop. That can only worse things (especially if it has no clear purpose). FYI, for the sake of curiosity, I did a quick test with the naive implementation and it takes 7 seconds to multiply 1024x1024 matrices. So, definitely your code is not working very well... — betabandido
I just wrote a test in python with numpy which takes 10.3 seconds for the 4096x4096 case on a single core of Core i7. So yes, you have big code efficiency problems, and what @betabandido pointed out is probably the cause. There are a huge number of redundant IOPs in that code. — talonmies

Alex Chamberlain Alex Chamberlain · Accepted Answer · 2012-05-24T11:55:42

BLAS is the best way to go, if you can.

Having said that, fundamentally, matrix multiplication is limited by complexity, so you will have to be more intelligent to drive down times substantially. Are the matrices structured in any way? Are they tridiagonal or banded? Are they triangular or symmetric?

matrix multiplication with C and OpenMP

3 Answers