Suppose I have an n-dimensional array and want to sum over m of its dimensions, so that in the end I am left with an n-m-dimensional array (n>m). If I just want to sum an array over a single dimension, I can pass an scalar integer to the SUM-intrinsic, e.g. SUM(array, DIM = 2). However, if I want to sum over more than one dimension (but not all), I didn't find any information how to do that in a simple manner, so the only thing I figured out was to chain several SUM commands. As an example, let's assume that I would want to sum over two dimensions of a 3d array:
PROGRAM sum_test
IMPLICIT NONE
INTEGER :: a(2,2,2) = reshape( (/1, 2, 3, 4, 5, 6, 7, 8/), (/2,2,2/), order = (/ 1, 2, 3 /) )
INTEGER :: b(2,2)
INTEGER :: c(2)
INTEGER :: d(2)
d = SUM(SUM(a, DIM=3), DIM=2)
WRITE (*,*) d
END PROGRAM sum_test
This gives the expected result. However, if I would have more dimensions, this would possibly become quite hard to read. So is there a 'better' way how to achieve this? In python's numpy, for instance, one can pass a sequence of integers to the sum command.
EDIT:
To test the performance of the different solutions suggested by @francescalus, I wrote a little program and compiled with two different compilers on my laptop (mac with High Sierra, compilers are g95 and gfortran-mp-7) and on two linux machines that I have available. For each compiler I used -O2 as optimisation option.
As g95 does not support do concurrent, I left that version out. I was also interested how the combination of ranks that are being summed over affects performance, I summed a 4d array once over the second and fourth and once over the third and fourth rank. Here is the code:
PROGRAM sum_performance
IMPLICIT NONE
INTEGER, PARAMETER :: size = 100
REAL :: array(size, size, size, size)
REAL, DIMENSION(size, size) :: res11, res12, res13, res14
REAL, DIMENSION(size, size) :: res21, res22, res23, res24
INTEGER :: i,j,k,l
INTEGER :: clock_start, clock_stop, clock_rate
INTEGER :: count
INTEGER, PARAMETER :: repeat = 100
!getting clock rate:
CALL SYSTEM_CLOCK(count_rate = clock_rate)
!setting up array:
CALL RANDOM_NUMBER(array)
!summing second and fourth index
WRITE (*,*) 'second and fourth rank'
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
res11 = SUM(SUM(array, DIM = 4), DIM = 2)
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'chained sum: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
DO l = 1,size
DO j = 1,size
res12(j,l) = SUM(array(:,j,:,l))
END DO
END DO
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'nested do: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
FORALL (j = 1:size, l = 1:size)
res13(j,l) = SUM(array(:,j,:,l))
END FORALL
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'forall: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
!summing second and fourth index
WRITE (*,*)
WRITE (*,*) 'third and fourth rank'
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
res21 = SUM(SUM(array, DIM = 4), DIM = 3)
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'chained sum: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
DO l = 1,size
DO k = 1,size
res22(k,l) = SUM(array(:,:,k,l))
END DO
END DO
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'nested do: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
CALL SYSTEM_CLOCK(count = clock_start)
DO count = 1,repeat
FORALL (k = 1:size, l = 1:size)
res23(k,l) = SUM(array(:,:,k,l))
END FORALL
END DO
CALL SYSTEM_CLOCK(count = clock_stop)
WRITE(*,*) 'forall: ', (REAL(clock_stop - clock_start)/&
REAL(clock_rate))/REAL(repeat)
END PROGRAM
The results were surprising in quite a few ways. For mac g95 I get:
second and fourth rank
chained sum: 0.193214
nested do: 0.140472
forall: 0.18884899
third and fourth rank
chained sum: 0.196938
nested do: 0.114286005
forall: 0.115414
for mac gfortran-mp-7 I get
second and fourth rank
chained sum: 0.279830009
nested do: 0.131999999
forall: 0.130150005
third and fourth rank
chained sum: 3.01672006
nested do: 0.111460000
forall: 0.110610001
For the two linux machines the relative performances are similar to the mac g95 results, although absolute performances are different. As one can see, the chained SUM solution shows the worst performance (maybe do to a time penalty, because of a temporary array creation?). Furthermore something funny happens when summing over ranks 3 and 4 (this is a robust feature, I checked several times). I'm guessing that this has something to do with how the temporary arrays are created (if they are created). Funnily it only happens with one of the two compilers. The nested DO loops and the FORALL seem to be pretty much comparable, so I guess it is up to personal taste which one to prefer.