4
votes

I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library).

Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. I'd like to understand why.

Here is the code I've used for the Fortran version:

program MatMulTest

implicit none

integer, parameter :: N = 1024
integer :: i, j, cr, cm
real*8 :: t0, t1, rate
real*8 :: A(N,N), B(N,N), C(N,N)    

call random_seed()
call random_number(A)
call random_number(B)

! First initialize the system_clock
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = real(cr)
WRITE(*,*) "system_clock rate: ", rate

call cpu_time(t0)
do i = 1, 100, 1
    C=MatMul(A,B)                
end do
call cpu_time(t1)

write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"
write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)

end program MatMulTest

Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds.

The Python code:

import time
import numpy as np

def main(N):
    A = np.random.rand(N,N)
    B = np.random.rand(N,N)
    for i in range(100):
        C = np.dot(A,B)
    print C[0,0]

if __name__ == "__main__":
    N = 1024
    t0 = time.clock()
    main(N)
    t1 = time.clock()
    print "Time elapsed: " + str((t1-t0)*10) + " ms"

And, finally, the MATLAB snippet:

N=1024;
A=rand(N,N); B=rand(N,N);
tic;
for i=1:100
     C=A*B;
end
t=toc;
disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])

On my system, the results are as follows:

Fortran: 38.08 ms
Python: 104.29 ms
MATLAB: 97.36 ms

CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran.

Is there any established reason for the discrepancy here? Am I doing something wrong? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case.

1
As commented: do not time the for, just time the multiplication!Ander Biguri
Why you inquire for the system_clock() rate, when you than measure the time with cpu_time()? Use one or the other, they have different purpose.Vladimir F
One of the reasons, as stupid as it might seem, could be the for loop. Loops are usually slower in python/matlab than C/Fortran. As @AnderBiguri suggested, try timming only the matrix multiplication, pick 100 timmings and average them in 3 languages.Imanol Luengo

1 Answers

6
votes

You have two issues here:

  1. In Python, you time the random initialisation as well as the computation, which you don't in Fortran and MATLAB
  2. In Fortran, you measure the CPU time while you measure the elapsed time in Python and MATLAB. And since you noticed that the CPU usage is around 46%, this might just account for the difference.

Just fix these two things and retry... You might consider using date_and_time() rather than cpu_time() for that purpose.