0
votes

I am trying to create an efficient algorithm that can multiply matrices of large values that are of double precision. I've created the algorithm and tested it on small matrices first; after trying out i.e. A{4096x4096}, B{4096x4096} the loop takes forever to end; for those two matrices for example to produce AB took my computer more than 30 minutes to complete.

My computer is not an old slouch...it's a six-core i7 and I guess for a desktop workstation it is not that bad. On small matrices of sizes up to 1024x1024 it's completes relatively quickly, that is under 30-40 seconds and for 2048x2048 around 5 minutes... for 16384x16384 it didn't finish in 15 minutes and I stopped execution...

Am I doing something wrong or this is to be expected? :)

Thanks in advance!

The code is the following:

/* calculate */
for(travx = 0; travx < m; travx++) {
    for(travy = 0; travy < n; travy++) {
        /* we only need to calculate it ourside of Z loop */
        tIndex = (travy)+(travx*n); 
        for(travz = 0; travz < p; travz++)
            {
                if(n==1)
                    {bIndex = ((n-1)*travy)+travz;
                     aIndex = ((p)*travx)+travz;} 
                else
                    {bIndex = ((n)*travz)+travy;
                     aIndex = ((p)*travx)+travz;}

                temp = atab_ptr[aIndex]*btab_ptr[bIndex];
                outtab_ptr[tIndex] =  outtab_ptr[tIndex] + temp;
            }
    }
}

it's really straightforward... and gives great results on small matrices... no idea how you can multiply doubles under 10 secs on especially on p4... sounds kinda fishy... especially if you take in account the O(3) complexity of the problem.

Updates...based on feedback I've tweaked the code and... Well mainly I did a simplification of it and small matrix complete much faster, that is 1024x1024 is done within 3 seconds, but 4096x4096 is done in 6 minutes... the revised code is this:

for(travx = 0; travx < m; travx++) {
    for(travy = 0; travy < n; travy++) {
      for(travz = 0; travz < p; travz++)
        {outtab_ptr[travy+travx*n] = outtab_ptr[travy+travx*n] + atab_ptr[travy+p*travz] *  btab_ptr[travz+travx*p];}
    }
  }
3
I can multiply two 4096² matrices in less than 10s on one core of a Pentium @3.2GHz, so yes, you might be doing something wrong.Fred Foo
your problem statement is really vague... which algorithm? how much time ("quickly", "forever" are not numbers)? single-threaded?UmNyobe
You should post some code, or explain which library you are using (if any).betabandido
I do not understand why you have a conditional inside the inner loop. That can only worse things (especially if it has no clear purpose). FYI, for the sake of curiosity, I did a quick test with the naive implementation and it takes 7 seconds to multiply 1024x1024 matrices. So, definitely your code is not working very well...betabandido
I just wrote a test in python with numpy which takes 10.3 seconds for the 4096x4096 case on a single core of Core i7. So yes, you have big code efficiency problems, and what @betabandido pointed out is probably the cause. There are a huge number of redundant IOPs in that code.talonmies

3 Answers

4
votes

BLAS is the best way to go, if you can.

Having said that, fundamentally, matrix multiplication is limited by complexity, so you will have to be more intelligent to drive down times substantially. Are the matrices structured in any way? Are they tridiagonal or banded? Are they triangular or symmetric?

1
votes

Your "efficient" algorithm is actually quite inefficient. See what happens when n is not 1:

bIndex = ((n)*travz)+travy;
aIndex = ((p)*travx)+travz;
temp = atab_ptr[aIndex]*btab_ptr[bIndex];

The innermost loop is over travz so aIndex increases with step 1 on each increment of travz. On the other hand bIndex increases with a step of n. So you are accessing elements of btab_ptr that are not adjacent in memory and hence not in the same cache line.

Not to mention what effects conditionals in the innermost loops have on possible vectorisation.

So your algorithm works acceptably fast if the data for all matrices can fit in the L3 cache of Core i7 but as soon as this is not the case your performance drops drastically. This is then further multiplied by the O(N^3) complexity.

0
votes

Well, the naive approach to matrix multiplication is O(n^3). That means that the time it takes to multiply two matrices grows with the size of the input in a cubic manner. There are more efficient approaches. Here you can have a look:

http://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra

Still none of those approaches is below O(n^2). So it is normal that, as you increase the size of the matrices, the time to complete grows more and more in a super-linear manner.

That being said, whether the time you are observing is too much or not, that depends on many factors (your machine, your code, etc.).

Btw, you can have a look at this thread where a very similar question is asked. And, unless you are doing it for educational purposes, you'd be better using an optimized library such as ATLAS.

Here you also have a classic document on how to optimize applications for better memory usage. In that document, the author uses several techniques such as aligning and prefetching in order to optimize the performance of matrix multiplication.