0
votes

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .

To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .

I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .

There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .

I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .

I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?

2
Wait... I'm confused. You mention IEEE and bits which means you know about FP representation. But at the same time you're asking how to truncate decimals which is impossible in binary floating-point. Am I missing something?Mysticial
@user354900 I disagree, work to the best precision you can, and truncate or round for display. By truncating during the calculations, you will "spread" a bigger error, not less.Weather Vane
@user354900 "Chopping mode" in IEEE floating point is just truncating the binary representation. In no way does it map to decimal digits unless you happen to be rounding to an integer.Mysticial
Floating point values (typically) do not have "decimal values" - that's a conversion to human readable format. If you don't need them, then you are not doing any further calculations with the values. So you might just as well retain the precision and deal with truncation at output. If you are doing more calculations, then you do need maximum precison.Weather Vane
@user354900 If you're really asking that question. Then I presume you don't know floating-point math. So I refer you to this: docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.htmlMysticial

2 Answers

3
votes

As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index

We’ve finally reached the point in this series that I’ve been waiting for. In this post I am going to share the most crucial piece of floating-point math knowledge that I have. Here it is:

[Floating-point] math is hard.

You just won’t believe how vastly, hugely, mind-bogglingly hard it is. I mean, you may think it’s difficult to calculate when trains from Chicago and Los Angeles will collide, but that’s just peanuts to floating-point math.

(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)


How you could actually implement your bad idea:

There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.

Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.

So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.


The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.

FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.

FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.

This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.

Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.


PS. if you didn't recognize Bruce's HHGTG reference, the original is:

Space is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is. I mean you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.

2
votes

how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?

The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.

  1. Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."

  2. It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".

  3. Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.

  4. Text Method:

    //       - x    xxx...xxx    . xxxxx \0
    char buf[1+1+ DBL_MAX_10_EXP+1  +5   +1];
    sprintf(buf, "%.5f", x);
    x = atof(buf);
    
  5. round() rint() method:

    #define SCALE 100000.0
    if (fabs(x) < DBL_MAX/SCALE) {
      x = x*SCALE;
      x = rint(x)/SCALE;
    }
    
  6. Direct bit manipulation of x. Simply zero select bits in the significand.

    TBD code.