7
votes

I'm reading the book "Scientific Software Development with Fortran", and there is an exercise in it I think very interesting:

"Create a Fortran module called MatrixMultiplyModule. Add three subroutines to it called LoopMatrixMultiply, IntrinsicMatrixMultiply, and MixMatrixMultiply. Each routine should take two real matrices as argument, perform a matrix multiplication, and return the result via a third argument. LoopMatrixMultiply should be written entirely with do loops, and no array operations or intrinsic procedures; IntrinsicMatrixMultiply should be written utilizing the matmul intrinsic function; and MixMatrixMultiply should be written using some do loops and the intrinsic function dot_product. Write a small program to test the performance of these three different ways of performing the matrix multiplication for different sizes of matrices."

I did some test of multiply of two rank 2 matrix and here are the results, under different optimization flags:

enter image description hereenter image description hereenter image description here

compiler:ifort version 13.0.0 on Mac 

Here is my question:

Why under -O0 they have about the same performance but matmul has huge performance boost when using -O3, while explicit loop and dot product has less performance boost? Also, why dot_product seems have the same performance compare to explicit do loops?

The code I use is the following:

module MatrixMultiplyModule

contains
    subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=0.

        do i=1,m
            do j=1,n
                do k=1,size(mtx1,dim=2)
                    mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
                end do      
            end do
        end do
    end subroutine

    subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=matmul(mtx1,mtx2)

    end subroutine

    subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))

        do i=1,m
            do j=1,n
                    mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
            end do
        end do

    end subroutine

end module




program main
use MatrixMultiplyModule
implicit none

real,allocatable        ::  a(:,:),b(:,:)
real,allocatable    ::  c1(:,:),c2(:,:),c3(:,:)
integer ::  n
integer ::  count, rate
real    ::  timeAtStart, timeAtEnd
real    ::  time(3,10)
do n=100,1000,100
    allocate(a(n,n),b(n,n))

    call random_number(a)
    call random_number(b)

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call LoopMatrixMultiply(a,b,c1)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(1,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call IntrinsicMatrixMultiply(a,b,c2)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(2,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call MixMatrixMultiply(a,b,c3)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(3,n/100)=timeAtEnd-timeAtStart


    deallocate(a,b)

end do

open(1,file="time.txt")
do n=1,10
    write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program
1
Your program doesn't actually use the results of the matrix multiplication. Perhaps not applicable here, but be mindful that if the compiler is clever enough it will simply eliminate the call to the various matrix routines entirely. What other command line options are you using?IanH
@lanH No other flags were used. I only use the flag -O in these cases, for example, ifort -O3 sub.f90 main.f90.xslittlegrass
You can ensure the code is used by printing the resulting vector to a scratch file outside of the timing calls.weymouth
One reason this sort of exercise is included in that sort of book is to provoke you into thinking, and learning, more deeply about program performance: such topics as how you structure your code in a high-level language, how the compiler translates it into assembler, what different transformations the compiler makes for the various levels of optimisation, and much more besides. So what do you think ? Cache, vector instructions, memory-access patterns, loop-unrolling, array tiling, strictness of f-p operations, all sorts of factors come into play and are worth investigating.High Performance Mark
@xslittlegrass my test shows that for matmul -O3 is much faster than -O2. For 1000x1000 random real matrix multiplication. -O3 takes 0.14s, -O2 takes 0.48suser15964

1 Answers

6
votes

There are several things one should be aware of when looping over array elements:

  • Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:

    do j=1,n
        do k=1,size(mtx1,dim=2)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    

    now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)

  • Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:

    l=size(mtx1,dim=2)
    ichunk=512 ! I have a 3MB cache size (real*4)
    do jj=1,n,ichunk
        do kk=1,l,ichunk
    
    do j=jj,min(jj+ichunk-1,n)
        do k=kk,min(kk+ichunk-1,l)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    
        end do
    end do
    

    in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).

  • Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.

If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.

The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.