LAPACK/BLAS sgemm() slower than custom matrix multiplication

Question

I just ran a matrix * matrix multiplication once with LAPACK/BLAS and once with custom loop optimizations (tiling). I am a bit irritated because a simple loop tiling approach is approximately 43% faster than the BLAS algorithm. Basically, my question is whether I am making a mistake applying the BLAS routine. Here is my code:

program test 
  implicit none

  integer, parameter :: N = 1000, tile = 2
  real*4, dimension(N,N) :: a,b,c,temp 
  integer :: i,j,k,x,y,z
  double precision ::  E,S
  real :: alpha = 1.0, beta = 0.0

  call random_seed()
  call random_number(a)
  call random_number(b)

  call cpu_time(S)

  ! call sgemm('n','n',N, N, N, alpha,a,N,b,N, beta,c,N)

  do j = 1,N,tile
     do k = 1,N,tile
        do i = 1,N,tile
           do y = j, min( j+tile-1,N)
              do x = i, min( i+tile-1,N)
                 do z = k, min( k+tile-1,N)
                    c(x,y) = c(x,y) + a(x,z) * b(z,y)
                 enddo
              enddo
           enddo
        enddo
     enddo
  enddo

  call cpu_time(E)
  print*,(E-S)
end program test

I run this calculation on a Intel Dual Core2 machine with 4gb DRAM and 3096kb Cache. The program is compiled with:

$gfortran -O3 test.f03 -o test
0.9359

for the loop and:

$gfortran test.f03 -lblas -O3 -o test
1.3399

So am I not getting something about BLAS, am I missing something (compiler optimization, or well I just don't know what)? I ran a similar code with C++ with and without Eigen::Matrix and got considerable gain from using the Eigen library for MMM, which is why my expectation was similar high to the BLAS library.

ztik ztik · Accepted Answer · 2014-11-24T13:11:59

The BLAS routine is used correctly. The only difference is that BLAS is performing

C = 0.0*C + 1.0*A*B

and your loop

C = C + A*B

In your loop you are trying to improve usage of cpu's cache memory. There are variants of BLAS that perform similar actions. I suggest you to try openblas, atlas or mkl (intel compiler) libraries. You will get great time improvements.

LAPACK/BLAS sgemm() slower than custom matrix multiplication

1 Answers