3
votes

Suppose I have an n-dimensional array and want to sum over m of its dimensions, so that in the end I am left with an n-m-dimensional array (n>m). If I just want to sum an array over a single dimension, I can pass an scalar integer to the SUM-intrinsic, e.g. SUM(array, DIM = 2). However, if I want to sum over more than one dimension (but not all), I didn't find any information how to do that in a simple manner, so the only thing I figured out was to chain several SUM commands. As an example, let's assume that I would want to sum over two dimensions of a 3d array:

PROGRAM sum_test

  IMPLICIT NONE

  INTEGER :: a(2,2,2) = reshape( (/1, 2, 3, 4, 5, 6, 7, 8/), (/2,2,2/), order = (/ 1, 2, 3 /) )
  INTEGER :: b(2,2)
  INTEGER :: c(2)
  INTEGER :: d(2)

  d = SUM(SUM(a, DIM=3), DIM=2)

  WRITE (*,*) d

END PROGRAM sum_test

This gives the expected result. However, if I would have more dimensions, this would possibly become quite hard to read. So is there a 'better' way how to achieve this? In python's numpy, for instance, one can pass a sequence of integers to the sum command.

EDIT:

To test the performance of the different solutions suggested by @francescalus, I wrote a little program and compiled with two different compilers on my laptop (mac with High Sierra, compilers are g95 and gfortran-mp-7) and on two linux machines that I have available. For each compiler I used -O2 as optimisation option.

As g95 does not support do concurrent, I left that version out. I was also interested how the combination of ranks that are being summed over affects performance, I summed a 4d array once over the second and fourth and once over the third and fourth rank. Here is the code:

PROGRAM sum_performance
  IMPLICIT NONE

  INTEGER, PARAMETER :: size = 100
  REAL :: array(size, size, size, size)
  REAL, DIMENSION(size, size) :: res11, res12, res13, res14
  REAL, DIMENSION(size, size) :: res21, res22, res23, res24

  INTEGER :: i,j,k,l
  INTEGER :: clock_start, clock_stop, clock_rate
  INTEGER :: count
  INTEGER, PARAMETER :: repeat = 100

  !getting clock rate:
  CALL SYSTEM_CLOCK(count_rate = clock_rate)

  !setting up array:
  CALL RANDOM_NUMBER(array)

  !summing second and fourth index
  WRITE (*,*) 'second and fourth rank'

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     res11 = SUM(SUM(array, DIM = 4), DIM = 2)
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'chained sum:  ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     DO l = 1,size
        DO j = 1,size
           res12(j,l) = SUM(array(:,j,:,l))
        END DO
     END DO
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'nested do:    ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     FORALL (j = 1:size, l = 1:size)
           res13(j,l) = SUM(array(:,j,:,l))
     END FORALL
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'forall:       ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)


 !summing second and fourth index
  WRITE (*,*)
  WRITE (*,*) 'third and fourth rank'

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     res21 = SUM(SUM(array, DIM = 4), DIM = 3)
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'chained sum:  ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     DO l = 1,size
        DO k = 1,size
           res22(k,l) = SUM(array(:,:,k,l))
        END DO
     END DO
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'nested do:    ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)

  CALL SYSTEM_CLOCK(count = clock_start)
  DO count = 1,repeat
     FORALL (k = 1:size, l = 1:size)
           res23(k,l) = SUM(array(:,:,k,l))
     END FORALL
  END DO
  CALL SYSTEM_CLOCK(count = clock_stop)
  WRITE(*,*) 'forall:       ', (REAL(clock_stop - clock_start)/&
       REAL(clock_rate))/REAL(repeat)

END PROGRAM

The results were surprising in quite a few ways. For mac g95 I get:

second and fourth rank
chained sum:   0.193214
nested do:     0.140472
forall:        0.18884899

third and fourth rank
chained sum:   0.196938
nested do:     0.114286005
forall:        0.115414

for mac gfortran-mp-7 I get

second and fourth rank
chained sum:    0.279830009    
nested do:      0.131999999    
forall:         0.130150005    

third and fourth rank
chained sum:     3.01672006    
nested do:      0.111460000    
forall:         0.110610001    

For the two linux machines the relative performances are similar to the mac g95 results, although absolute performances are different. As one can see, the chained SUM solution shows the worst performance (maybe do to a time penalty, because of a temporary array creation?). Furthermore something funny happens when summing over ranks 3 and 4 (this is a robust feature, I checked several times). I'm guessing that this has something to do with how the temporary arrays are created (if they are created). Funnily it only happens with one of the two compilers. The nested DO loops and the FORALL seem to be pretty much comparable, so I guess it is up to personal taste which one to prefer.

1

1 Answers

2
votes

I wouldn't underestimate how readable loop constructs can be in situations like this.

It'll be hard to sensibly give a general recipe, so I'll just give examples.

For the example of the question:

do i=1,SIZE(d)
  d(i) = SUM(a(i,:,:))
end do

With higher rank results, one could nest such do constructs, or use a forall construct or do concurrent

forall (i=1:SIZE(d,1), j=1:SIZE(d,2))
  d(i,j) = SUM(a(i,:,:,j)
end

or

do concurrent (i=1:SIZE(d,1), j=1:SIZE(d,2))
  d(i,j) = SUM(a(i,:,:,j)
end

Loops/foralls can be avoided with reshape if really required:

d = SUM(RESHAPE(a,[2,4]),dim=2)

but reordering on reshaping may well be required to get the reduction over the desired dimensions. This could easily become unreadable or unclear.