4
votes

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.

The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.

Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.

So my question is: why doesn't a critical section work in this case?

My code is below. All shared variables except np, tm, hm, gam are read-only.

EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!

TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.

!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
  DO i=1,nd        
!$OMP CRITICAL (main)
! Second loop over the data:
    DO j=1,nd

! The lag:
      zdis = z(j) - z(i)
      IF(zdis >= 0.0) THEN
        izl =  INT( zdis/dzlag+0.5)
      ELSE
        izl = -INT(-zdis/dzlag+0.5)
      END IF
 ! ---- SNIP ----

! Loop over all variograms for this lag:

      DO cur_variogram=1,nvarg
        variogram_type = ivtype(cur_variogram)

! Get the head and tail values:

        indx = i+(ivhead(cur_variogram)-1)*maxdim
        vrh   = vr(indx)
        indx = j+(ivtail(cur_variogram)-1)*maxdim
        vrt   = vr(indx)
        IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE

        ! ----- PROBLEM AREA -------
        np(ixl,iyl,izl,1)  = np(ixl,iyl,izl,1) + 1.   ! <-- This never fails
        tm(ixl,iyl,izl,1)  = tm(ixl,iyl,izl,1) + vrt  
        hm(ixl,iyl,izl,1)  = hm(ixl,iyl,izl,1) + vrh
        gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
        ! ----- END OF PROBLEM AREA -----

        !CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
      END DO
    END DO
    !$OMP END CRITICAL (main)
  END DO
!$OMP END DO
!$OMP END PARALLEL

Thanks very much in advance!

2
I would suggest trying to post a SSCCE instead of an incomplete snippet. This will force you to analyze better what actually causes the problem and will give other people the possibility to reproduce it in their own environment.Massimiliano
some of the results are off by a thousandth or so Is that a measure of absolute error or of relative error ? If the former, what is the magnitude of the latter ?High Performance Mark
@Massimiliano Normally I would agree with you, however, it is not possible to provide a self-contained example in this case due to initialization and data in general.Painless
@HighPerformanceMark The error I reported was absolute. The relative error is very small (getting results such as 84.26539 vs 84.26538). However, the absolute error still seems way too big and I can't dismiss it. I also tested the ordering of the operations by randomly walking through the number range of the do loops and it produced exact results. I also doubt there's any SIMD magic behind this.Painless
Use real kind with larger precision and see what it does.Vladimir F

2 Answers

3
votes

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.

Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.

The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.

I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

2
votes

High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).

float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c);   //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000

But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end

Edit:

Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.