20
votes

I would like to compute the sum, rounded up, of two IEEE 754 binary64 numbers. To that end I wrote the C99 program below:

#include <stdio.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON

int main(int c, char *v[]){
  fesetround(FE_UPWARD);
  printf("%a\n", 0x1.0p0 + 0x1.0p-80);
}

However, if I compile and run my program with various compilers:

$ gcc -v
…
gcc version 4.2.1 (Apple Inc. build 5664)
$ gcc -Wall -std=c99 add.c && ./a.out 
add.c:3: warning: ignoring #pragma STDC FENV_ACCESS
0x1p+0
$ clang -v
Apple clang version 1.5 (tags/Apple/clang-60)
Target: x86_64-apple-darwin10
Thread model: posix
$ clang -Wall -std=c99 add.c && ./a.out 
add.c:3:14: warning: pragma STDC FENV_ACCESS ON is not supported, ignoring
      pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS ON
             ^
1 warning generated.
0x1p+0

It doesn't work! (I expected the result 0x1.0000000000001p0).

Indeed, the computation was done at compile-time in the default round-to-nearest mode:

$ clang -Wall -std=c99 -S add.c && cat add.s
add.c:3:14: warning: pragma STDC FENV_ACCESS ON is not supported, ignoring
      pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS ON
             ^
1 warning generated.
…
LCPI1_0:
    .quad   4607182418800017408
…
    callq   _fesetround
    movb    $1, %cl
    movsd   LCPI1_0(%rip), %xmm0
    leaq    L_.str(%rip), %rdx
    movq    %rdx, %rdi
    movb    %cl, %al
    callq   _printf
…
L_.str:
    .asciz   "%a\n"

Yes, I did see the warning emitted by each compiler. I understand that turning the applicable optimizations on or off at the scale of the line may be tricky. I would still like, if that was at all possible, to turn them off at the scale of the file, which would be enough to resolve my question.

My question is: what command-line option(s) should I use with GCC or Clang so as to compile a C99 compilation unit that contains code intended to be executed with an FPU rounding mode other than the default?

Digression

While researching this question, I found this GCC C99 compliance page, containing the entry below, that I will just leave here in case someone else finds it funny. Grrrr.

floating-point      |     |
environment access  | N/A | Library feature, no compiler support required.
in <fenv.h>         |     |
2
That's just so bad :(Kuba hasn't forgotten Monica
As far as clang is concerned, you're hitting llvm.org/bugs/show_bug.cgi?id=8100Stephen Canon
@StephenCanon This excellent bug report mentions the solution for GCC, -frounding-math.Pascal Cuoq
... which only sorta works, but at least it sorta works?Stephen Canon

2 Answers

1
votes

clang or gcc -frounding-math tells them that code might run with a non-default rounding mode. It's not fully safe (it assumes the same rounding mode is active the whole time), but better than nothing. You might still need to use volatile to avoid CSE in some cases, or maybe the noinline wrapper trick from the other answer which in practice may work even better if you limit it to a single operation.


As you noticed, GCC doesn't support #pragma STDC FENV_ACCESS ON. The default behaviour is like FENV_ACCESS OFF. Instead, you have to use command line options (or maybe per-function attributes) to control FP optimizations.

As described in https://gcc.gnu.org/wiki/FloatingPointMath, -frounding-math is not on by default, so GCC assumes the default rounding mode when doing constant propagation and other optimizations at compile-time.

But with gcc -O3 -frounding-math, constant propagation is blocked. Even if you don't call fesetround; what's actually happening is that GCC makes asm that's safe if the rounding mode had already been set to something else before main was even called.

But unfortunately, as the wiki notes, GCC still assumes that the same rounding mode is in effect everywhere (GCC bug #34678). That means it will CSE two calculations of the same inputs before/after a call to fesetround, because it doesn't treat fesetround as special.

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

void foo(double *restrict out){
    out[0] = 0x1.0p0 + 0x1.0p-80;
    fesetround(FE_UPWARD);
    out[1] = 0x1.0p0 + 0x1.0p-80;
}

compiles as follows (Godbolt) with gcc10.2 (and essentially the same with clang10.1). Also includes your main, which does make the asm you want.

foo:
        push    rbx
        mov     rbx, rdi
        sub     rsp, 16
        movsd   xmm0, QWORD PTR .LC1[rip]
        addsd   xmm0, QWORD PTR .LC0[rip]     # runtime add
        movsd   QWORD PTR [rdi], xmm0         # store out[0]
        mov     edi, 2048
        movsd   QWORD PTR [rsp+8], xmm0       # save a local temporary for later
        call    fesetround
        movsd   xmm0, QWORD PTR [rsp+8]
        movsd   QWORD PTR [rbx+8], xmm0       # store the same value, not recalc
        add     rsp, 16
        pop     rbx
        ret

This is the same problem @Marc Glisse warned about in comments under the other answer in case your noinline function did the same math before and after changing the rounding mode.

(And also that it's partly luck that GCC chose not to do the math before calling fesetround the first time, so it would only have to spill the result instead of both inputs. x86-64 System V doesn't have any call-preserved XMM regs.)

4
votes

I couldn't find any command line options that would do what you wanted. However, I did find a way to rewrite your code so that even with maximum optimizations (even architectural optimizations), neither GCC nor Clang compute the value at compile time. Instead, this forces them to output code that will compute the value at runtime.

C:

#include <fenv.h>
#include <stdio.h>

#pragma STDC FENV_ACCESS ON

// add with rounding up
double __attribute__ ((noinline)) addrup (double x, double y) {
  int round = fegetround ();
  fesetround (FE_UPWARD);
  double r = x + y;
  fesetround (round);   // restore old rounding mode
  return r;
}

int main(int c, char *v[]){
  printf("%a\n", addrup (0x1.0p0, 0x1.0p-80));
}

This results in these outputs from GCC and Clang, even when using maximum and architectural optimizations:

gcc -S -x c -march=corei7 -O3 (Godbolt GCC):

addrup:
        push    rbx
        sub     rsp, 16
        movsd   QWORD PTR [rsp+8], xmm0
        movsd   QWORD PTR [rsp], xmm1
        call    fegetround
        mov     edi, 2048
        mov     ebx, eax
        call    fesetround
        movsd   xmm1, QWORD PTR [rsp]
        mov     edi, ebx
        movsd   xmm0, QWORD PTR [rsp+8]
        addsd   xmm0, xmm1
        movsd   QWORD PTR [rsp], xmm0
        call    fesetround
        movsd   xmm0, QWORD PTR [rsp]
        add     rsp, 16
        pop     rbx
        ret
.LC2:
        .string "%a\n"
main:
        sub     rsp, 8
        movsd   xmm1, QWORD PTR .LC0[rip]
        movsd   xmm0, QWORD PTR .LC1[rip]
        call    addrup
        mov     edi, OFFSET FLAT:.LC2
        mov     eax, 1
        call    printf
        xor     eax, eax
        add     rsp, 8
        ret
.LC0:
        .long   0
        .long   988807168
.LC1:
        .long   0
        .long   1072693248

clang -S -x c -march=corei7 -O3 (Godbolt GCC):

addrup:                                 # @addrup
        push    rbx
        sub     rsp, 16
        movsd   qword ptr [rsp], xmm1   # 8-byte Spill
        movsd   qword ptr [rsp + 8], xmm0 # 8-byte Spill
        call    fegetround
        mov     ebx, eax
        mov     edi, 2048
        call    fesetround
        movsd   xmm0, qword ptr [rsp + 8] # 8-byte Reload
        addsd   xmm0, qword ptr [rsp]   # 8-byte Folded Reload
        movsd   qword ptr [rsp + 8], xmm0 # 8-byte Spill
        mov     edi, ebx
        call    fesetround
        movsd   xmm0, qword ptr [rsp + 8] # 8-byte Reload
        add     rsp, 16
        pop     rbx
        ret

.LCPI1_0:
        .quad   4607182418800017408     # double 1
.LCPI1_1:
        .quad   4246894448610377728     # double 8.2718061255302767E-25
main:                                   # @main
        push    rax
        movsd   xmm0, qword ptr [rip + .LCPI1_0] # xmm0 = mem[0],zero
        movsd   xmm1, qword ptr [rip + .LCPI1_1] # xmm1 = mem[0],zero
        call    addrup
        mov     edi, .L.str
        mov     al, 1
        call    printf
        xor     eax, eax
        pop     rcx
        ret

.L.str:
        .asciz  "%a\n"

Now for the more interesting part: why does that work?

Well, when they (GCC and/or Clang) compile code, they try to find and replace values that can be computed at runtime. This is known as constant propagation. If you had simply written another function, constant propagation would cease to occur, since it isn't supposed to cross functions.

However, if they see a function that they could, in theory, substitute the code of in place of the function call, they may do so. This is known as function inlining. If function inlining will work on a function, we say that that function is (surprise) inlinable.

If a function always return the same results for a given set of inputs, then it is considered pure. We also say that it has no side effects (meaning it makes no changes to the environment).

Now, if a function is fully inlinable (meaning that it doesn't make any calls to external libraries excluding a few defaults included in GCC and Clang - libc, libm, etc.) and is pure, then they will apply constant propagation to the function.

In other words, if we don't want them to propagate constants through a function call, we can do one of two things:

  • Make the function appear impure:
    • Use the filesystem
    • Do some bullshit magic with some random input from somewhere
    • Use the network
    • Use some syscall of some sort
    • Call something from an external library unknown to GCC and/or Clang
  • Make the function not fully inlinable
    • Call something from an external library unknown to GCC and/or Clang
    • Use __attribute__ ((noinline))

Now, that last one is the easiest. As you may have surmised, __attribute__ ((noinline)) marks the function as non-inlinable. Since we can take advantage of this, all we have to do is make another function that does whatever computation we want, mark it with __attribute__ ((noinline)), and then call it.

When it is compiled, they will not violate the inlining and, by extension, constant propagation rules, and therefore, the value will be computed at runtime with the appropriate rounding mode set.