I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library).
Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. I'd like to understand why.
Here is the code I've used for the Fortran version:
program MatMulTest
implicit none
integer, parameter :: N = 1024
integer :: i, j, cr, cm
real*8 :: t0, t1, rate
real*8 :: A(N,N), B(N,N), C(N,N)
call random_seed()
call random_number(A)
call random_number(B)
! First initialize the system_clock
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = real(cr)
WRITE(*,*) "system_clock rate: ", rate
call cpu_time(t0)
do i = 1, 100, 1
C=MatMul(A,B)
end do
call cpu_time(t1)
write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"
write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)
end program MatMulTest
Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds.
The Python code:
import time
import numpy as np
def main(N):
A = np.random.rand(N,N)
B = np.random.rand(N,N)
for i in range(100):
C = np.dot(A,B)
print C[0,0]
if __name__ == "__main__":
N = 1024
t0 = time.clock()
main(N)
t1 = time.clock()
print "Time elapsed: " + str((t1-t0)*10) + " ms"
And, finally, the MATLAB snippet:
N=1024;
A=rand(N,N); B=rand(N,N);
tic;
for i=1:100
C=A*B;
end
t=toc;
disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])
On my system, the results are as follows:
Fortran: 38.08 ms
Python: 104.29 ms
MATLAB: 97.36 ms
CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran.
Is there any established reason for the discrepancy here? Am I doing something wrong? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case.
system_clock()
rate, when you than measure the time withcpu_time()
? Use one or the other, they have different purpose. – Vladimir Ffor
loop. Loops are usually slower in python/matlab than C/Fortran. As @AnderBiguri suggested, try timming only the matrix multiplication, pick 100 timmings and average them in 3 languages. – Imanol Luengo