I just ran a matrix * matrix multiplication once with LAPACK/BLAS and once with custom loop optimizations (tiling). I am a bit irritated because a simple loop tiling approach is approximately 43% faster than the BLAS algorithm. Basically, my question is whether I am making a mistake applying the BLAS routine. Here is my code:
program test
implicit none
integer, parameter :: N = 1000, tile = 2
real*4, dimension(N,N) :: a,b,c,temp
integer :: i,j,k,x,y,z
double precision :: E,S
real :: alpha = 1.0, beta = 0.0
call random_seed()
call random_number(a)
call random_number(b)
call cpu_time(S)
! call sgemm('n','n',N, N, N, alpha,a,N,b,N, beta,c,N)
do j = 1,N,tile
do k = 1,N,tile
do i = 1,N,tile
do y = j, min( j+tile-1,N)
do x = i, min( i+tile-1,N)
do z = k, min( k+tile-1,N)
c(x,y) = c(x,y) + a(x,z) * b(z,y)
enddo
enddo
enddo
enddo
enddo
enddo
call cpu_time(E)
print*,(E-S)
end program test
I run this calculation on a Intel Dual Core2 machine with 4gb DRAM and 3096kb Cache. The program is compiled with:
$gfortran -O3 test.f03 -o test
0.9359
for the loop and:
$gfortran test.f03 -lblas -O3 -o test
1.3399
So am I not getting something about BLAS, am I missing something (compiler optimization, or well I just don't know what)? I ran a similar code with C++ with and without Eigen::Matrix and got considerable gain from using the Eigen library for MMM, which is why my expectation was similar high to the BLAS library.