Plain vs. allocatable/pointer arrays, Fortran advice?

Question

I wrote the following contrived example for matrix multiplication just to examine how declaring different types of arrays can affect the performance. To my surprise, I found that the performance of plain arrays with known sizes at declaration is inferior to both allocatable/pointer arrays. I thought allocatable was only needed for large arrays that don't fit into the stack. Here is the code and timings using both gfortran and Intel Fortran compilers. Windows 10 platform is used with compiler flags -Ofast and -fast, respectively.

program matrix_multiply
   implicit none
   integer, parameter :: n = 1500
   real(8) :: a(n,n), b(n,n), c(n,n), aT(n,n)                 ! plain arrays 
   integer :: i, j, k, ts, te, count_rate, count_max
   real(8) :: tmp

   ! real(8), allocatable :: A(:,:), B(:,:), C(:,:), aT(:,:)  ! allocatable arrays
   ! allocate ( a(n,n), b(n,n), c(n,n), aT(n,n) )

   do i = 1,n
      do j = 1,n
         a(i,j) = 1.d0/n/n * (i-j) * (i+j)
         b(i,j) = 1.d0/n/n * (i-j) * (i+j)
      end do 
   end do 

   ! transpose for cache-friendliness   
   do i = 1,n
      do j = 1,n
         aT(j,i) = a(i,j)
      end do 
   end do 

   call system_clock(ts, count_rate, count_max)
   do i = 1,n
      do j = 1,n
         tmp = 0 
         do k = 1,n
            tmp = tmp + aT(k,i) * b(k,j)
         end do
         c(i,j) = tmp
      end do
   end do
   call system_clock(te)
   print '(4G0)', "Elapsed time: ", real(te-ts)/count_rate,', c_(n/2+1) = ', c(n/2+1,n/2+1)    
end program matrix_multiply

The timings are as follows:

! Intel Fortran
! -------------
Elapsed time: 1.546000, c_(n/2+1) = -143.8334 ! Plain Arrays
Elapsed time: 1.417000, c_(n/2+1) = -143.8334 ! Allocatable Arrays  

! gfortran:
! -------------
Elapsed time: 1.827999, c_(n/2+1) = -143.8334 ! Plain Arrays 
Elapsed time: 1.702999, c_(n/2+1) = -143.8334 ! Allocatable Arrays

My question is why this happens? Do allocatable arrays give the compiler more guarantees to optimize better? What is the best advice in general when dealing with fixed size arrays in Fortran?

At the risk of lengthening the question, here is another example where Intel Fortran compiler exhibits the same behavior:

program testArrays
  implicit none
  integer, parameter :: m = 1223, n = 2015 
  real(8), parameter :: pi = acos(-1.d0)
  real(8) :: a(m,n)
  real(8), allocatable :: b(:,:)
  real(8), pointer :: c(:,:)
  integer :: i, sz = min(m, n), t0, t1, count_rate, count_max

  allocate( b(m,n), c(m,n) )
  call random_seed()
  call random_number(a)
  call random_number(b)
  call random_number(c)

  call system_clock(t0, count_rate, count_max)
    do i=1,1000
      call doit(a,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time plain: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( a(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000
      call doit(b,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time alloc: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( b(1:3,1:3) )

  call system_clock(t0)
    do i=1,1000 
      call doitp(c,sz)
    end do 
  call system_clock(t1)
  print '(4g0)', 'Time p.ptr: ', real(t1-t0)/count_rate, ',  sum 3x3 = ', sum( c(1:3,1:3) )

  contains 
  subroutine doit(a,sz)
    real(8) :: a(:,:)
    integer :: sz 
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end

  subroutine doitp(a,sz)
    real(8), pointer :: a(:,:)
    integer :: sz
    a(1:sz,1:sz) = sin(2*pi*a(1:sz,1:sz))/(a(1:sz,1:sz)+1)
  end    
end program testArrays

ifort timings:

Time plain: 2.857000,  sum 3x3 = -.9913536
Time alloc: 2.750000,  sum 3x3 = .4471794
Time p.ptr: 2.786000,  sum 3x3 = 2.036269

gfortran timings, however, are much longer but follow my expectation:

Time plain: 51.5600014,  sum 3x3 = 6.2749456118192093
Time alloc: 54.0300007,  sum 3x3 = 6.4144775892064283
Time p.ptr: 54.1900034,  sum 3x3 = -2.1546109819149963

IanH IanH · Accepted Answer · 2019-02-23T22:51:51

To get an idea whether the compiler thinks there is a difference, look at the generated assembly for the procedures. Based on a quick look here, the assembly for the timed section of the two cases for the first example appears to be more or less equivalent, in terms of the work that the processor has to do. This is as expected, because the arrays presented to the timed section are more or less equivalent - they are large, contiguous, not overlapping and with element values only known at runtime.

(Beyond the compiler, there can then be differences due to the way data presents in the various caches at runtime, but that should be similar for both cases as well.)

The main difference between explicit shape and allocatable arrays is in the time that it takes to allocate and deallocate the storage for the latter. There are only four allocations at most in your first example (so it is not likely to onerous relative to subsequent calculations), and you don't time that part of the program. Stick the allocation/implicit deallocation pair inside a loop, then see how you go.

Arrays with the pointer or target attribute may be subject to aliasing, so the compiler may have to do extra work to account for the possibility of storage for the arrays overlapping. However the nature of the expression in the second example (only the one array is referenced) is such that the compiler likely knows that there is no need for the extra work in this particular case, and the operations become equivalent again.

In response to "I thought allocatable was only needed for large arrays that don't fit into the stack" - allocatable is needed (i.e. you have no real choice) when you cannot determine the size or other characteristics of the thing being allocated in the specification part of the procedure responsible for the entirety of the existence of the thing. Even for things not known until runtime, if you can still determine the characteristics in the specification part of the relevant procedure, then automatic variables are an option. (There are no automatic variables in your example though - in the non-allocatable, non-pointer cases, all the characteristics of the arrays are known at compile time.) At a Fortran processor implementation level, which varies between compilers and compile options, automatic variables may require more stack space than is available, and this can cause problems that allocatables may alleviate (or you can just change compiler options).

Plain vs. allocatable/pointer arrays, Fortran advice?

2 Answers