0
votes

I am in the middle of instrumenting a fairly large-sized code with OpenACC. Right now, I am delaing with a routine foo that calls a few other routines bar, far, and boo, like so:

subroutine foo

real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)

do i=1,25
  call bar(barout, x(1,i),y(1,i),z(1,i))
  call far(farout, x(1,i),y(1,i),z(1,i))
  call boo(booout, x(1,i),y(1,i),z(1,i))
enddo

....

end subroutine foo

Couple of points: 1) x, y, and z stay constant through the loop. 2) You might not like the structure of the code here, but that is beyond my job description. I am supposed to instrument with OpenACC, period.

I am currently concentrating on the call to "bar". I want to make bar a vector routine. I am not ready to do the same for far and boo. So I would like to call bar from within a parallel region, but I am not ready to do the same with far and boo. (I said this is a work in progress, right?) Now, I could -- I think! -- sandwich bar in its own parallel region and copy data to and from it in each loop iteration

!$acc data copy(barout) &
!$acc&     copyin(x(:,:),y(:,:),z(:,:))
!$acc parallel
call bar( .... )
!$acc en parallel
!$acc end data

But that's alot of data transfer. It would be great if I could transfer x,y, and z to the device just once. Each of the routines has their own data regions, so as I understand it (Please correct me if I am wrong!) I cannot encase the entire loop in a single data region. Here was an alternative I tried

subroutine foo
!$acc routine(bar) vector

real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)

!$acc data create(x(:,:),y(:,:),z(:,:))
!$acc end data
do i=1,25
!$acc data copy(barout(:)) &
!$acc&     present(x(:,:),y(:,:),z(:,:))
!$acc parallel
  call bar(barout, x(1,i),y(1,i),z(1,i))
!$acc end parallel
!$acc end data
  call far(farout, x(1,i),y(1,i),z(1,i))
  call boo(booout, x(1,i),y(1,i),z(1,i))
enddo

....

end subroutine foo

But this doesn't work because the data in the copyin doesn't persist on the device. It is gone when the data present clause appears. (I've tried data create as well as data copyin.)

So is there a way to do what I am trying to do here? Thanks.

1

1 Answers

2
votes

Have the outer data region span across the "i" loop. As you have it, you have "end data" directly after the start so x, y, and z are deleted before the "i" loop and not present. I'd also recommend using update clauses within the loop to manage the data transfers. Something like:

subroutine foo
!$acc routine(bar) vector

real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)

!$acc data copyin(x, y, z), create(barout)
do i=1,25
!$acc update device(barout)
!$acc parallel
  call bar(barout, x(1,i),y(1,i),z(1,i))
!$acc end parallel
!$acc update host(barout)
  call far(farout, x(1,i),y(1,i),z(1,i))
  call boo(booout, x(1,i),y(1,i),z(1,i))
enddo
!$end data
....

end subroutine foo

Notes:

Since "bar" is a vector routine, calling it from a "parallel" region means that you'll only be using a single gang. It's not wrong code, but you will lose performance. It might be better to keep it as a host routine and then put the "parallel" inside of "bar" so you can use both "gang" and "vector". Granted, if your intent is to later move the inner "parallel" region to a "parallel loop gang" around the "i" loop, then it would make sense to leave it as is.

I changed your code to copyin x, y, and z since I wasn't sure where these variables get initialized. If they are initialized in "bar", you can change these back to use "create", but then add update directives to synchronize the host and device copies.