I am working on a code which includes a loop with many iterations (~10^6-10^7) where an array (let's say, 'myresult') is being calculated via summation over lots of contributions. In Fortran 90 with OpenMP, this will look something like:
!$omp parallel do
!$omp& reduction(+:myresult)
do i=1,N
myresult[i] = myresult[i] + [contribution]
enddo
!$omp end parallel
The code will be run on a system with Intel Xeon coprocessors, and would of course like to benefit from their existence, if possible. I have tried using MIC offloading statements (!dir$ offload target ...) with OpenMP so that the loop runs on just the coprocessor, but then I am wasting host CPU time while it sits there waiting for the coprocessor to finish. Ideally, one could divide up the loop between the host and the device, so I would like to know if something like the following is feasible (or if there is a better approach); the loop will only run on one core on the host (though perhaps with OMP_NUM_THREADS=2?):
!$omp parallel sections
!$omp& reduction(+:myresult)
!$omp section ! parallel calculation on device
!dir$ offload target mic
!$omp parallel do
!$omp& reduction(+:myresult)
(do i=N/2+1,N)
!$omp end parallel do
!$omp section ! serial calculation on host
(do i=1,N/2)
!$omp end parallel sections