Here's what's going on in binary. As we know, some floating-point values cannot be represented exactly in binary, even if they can be represented exactly in decimal. These 3 numbers are just examples of that fact.
With this program I output the hexadecimal representations of each number and the results of each addition.
public class Main{
public static void main(String args[]) {
double x = 23.53; // Inexact representation
double y = 5.88; // Inexact representation
double z = 17.64; // Inexact representation
double s = 47.05; // What math tells us the sum should be; still inexact
printValueAndInHex(x);
printValueAndInHex(y);
printValueAndInHex(z);
printValueAndInHex(s);
System.out.println("--------");
double t1 = x + y;
printValueAndInHex(t1);
t1 = t1 + z;
printValueAndInHex(t1);
System.out.println("--------");
double t2 = x + z;
printValueAndInHex(t2);
t2 = t2 + y;
printValueAndInHex(t2);
}
private static void printValueAndInHex(double d)
{
System.out.println(Long.toHexString(Double.doubleToLongBits(d)) + ": " + d);
}
}
The printValueAndInHex
method is just a hex-printer helper.
The output is as follows:
403787ae147ae148: 23.53
4017851eb851eb85: 5.88
4031a3d70a3d70a4: 17.64
4047866666666666: 47.05
--------
403d68f5c28f5c29: 29.41
4047866666666666: 47.05
--------
404495c28f5c28f6: 41.17
4047866666666667: 47.050000000000004
The first 4 numbers are x
, y
, z
, and s
's hexadecimal representations. In IEEE floating point representation, bits 2-12 represent the binary exponent, that is, the scale of the number. (The first bit is the sign bit, and the remaining bits for the mantissa.) The exponent represented is actually the binary number minus 1023.
The exponents for the first 4 numbers are extracted:
sign|exponent
403 => 0|100 0000 0011| => 1027 - 1023 = 4
401 => 0|100 0000 0001| => 1025 - 1023 = 2
403 => 0|100 0000 0011| => 1027 - 1023 = 4
404 => 0|100 0000 0100| => 1028 - 1023 = 5
First set of additions
The second number (y
) is of smaller magnitude. When adding these two numbers to get x + y
, the last 2 bits of the second number (01
) are shifted out of range and do not figure into the calculation.
The second addition adds x + y
and z
and adds two numbers of the same scale.
Second set of additions
Here, x + z
occurs first. They are of the same scale, but they yield a number that is higher up in scale:
404 => 0|100 0000 0100| => 1028 - 1023 = 5
The second addition adds x + z
and y
, and now 3 bits are dropped from y
to add the numbers (101
). Here, there must be a round upwards, because the result is the next floating point number up: 4047866666666666
for the first set of additions vs. 4047866666666667
for the second set of additions. That error is significant enough to show in the printout of the total.
In conclusion, be careful when performing mathematical operations on IEEE numbers. Some representations are inexact, and they become even more inexact when the scales are different. Add and subtract numbers of similar scale if you can.
(2.0^53 + 1) - 1 == 2.0^53 - 1 != 2^53 == 2^53 + (1 - 1)
). Hence, yes: be wary when choosing the order of sums and other operations. Some languages provide a built-in to perform "high-precision" sums (e.g. python'smath.fsum
), so you might consider using these functions instead of the naive sum algorithm. – Bakuriu