39
votes

I keep getting mixed answers of whether floating point numbers (i.e. float, double, or long double) have one and only one value of precision, or have a precision value which can vary.

One topic called float vs. double precision seems to imply that floating point precision is an absolute.

However, another topic called Difference between float and double says,

In general a double has 15 to 16 decimal digits of precision

Another source says,

Variables of type float typically have a precision of about 7 significant digits

Variables of type double typically have a precision of about 16 significant digits

I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?

10
It is stored as binary internally, so decimal precision is not accurate.n0rd
If you don't like approximations, use fixed-point math instead.Michael Dorgan
The about is due to the conversion from significant bits to significant digits.Degustaf
There's a nice series on floating point math on this blog. Due to the inexact conversion between binary and decimal representation, you're not going to really get a better answer than "about" so you might want to fully read up on the topic.jaggedSpire
@MichaelDorgan: if you don't like approximations, you'll need to stick to integer math. Fixed-point (though somewhat easier predictable than floating) is still just an approximation to the reals/rationals which are what you really want to express, in almost all interesting applications. And it's typically a worse approximation than floating-point!leftaroundabout

10 Answers

29
votes

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits.


The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.

To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.

So, each number will look like: (-1)s × 2yyy × 1.xxxx (where s is the sign bit, yyy is the exponent, and 1.xxxx is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.

Here's a table of what 1.xxxx looks like for all xxxx values (all rounding is half-to-even, just like how the default floating-point rounding mode works):

  xxxx  |  1.xxxx  |  value   |  2dd  |  3dd  
--------+----------+----------+-------+--------
  0000  |  1.0000  |  1.0     |  1.0  |  1.00
  0001  |  1.0001  |  1.0625  |  1.1  |  1.06
  0010  |  1.0010  |  1.125   |  1.1  |  1.12
  0011  |  1.0011  |  1.1875  |  1.2  |  1.19
  0100  |  1.0100  |  1.25    |  1.2  |  1.25
  0101  |  1.0101  |  1.3125  |  1.3  |  1.31
  0110  |  1.0110  |  1.375   |  1.4  |  1.38
  0111  |  1.0111  |  1.4375  |  1.4  |  1.44
  1000  |  1.1000  |  1.5     |  1.5  |  1.50
  1001  |  1.1001  |  1.5625  |  1.6  |  1.56
  1010  |  1.1010  |  1.625   |  1.6  |  1.62
  1011  |  1.1011  |  1.6875  |  1.7  |  1.69
  1100  |  1.1100  |  1.75    |  1.8  |  1.75
  1101  |  1.1101  |  1.8125  |  1.8  |  1.81
  1110  |  1.1110  |  1.875   |  1.9  |  1.88
  1111  |  1.1111  |  1.9375  |  1.9  |  1.94

How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.

For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.


Okay, then, so what happens if we halve all the numbers (so we're using yyy = -1)?

  xxxx  |  1.xxxx  |  value    |  1dd  |  2dd  
--------+----------+-----------+-------+--------
  0000  |  1.0000  |  0.5      |  0.5  |  0.50
  0001  |  1.0001  |  0.53125  |  0.5  |  0.53
  0010  |  1.0010  |  0.5625   |  0.6  |  0.56
  0011  |  1.0011  |  0.59375  |  0.6  |  0.59
  0100  |  1.0100  |  0.625    |  0.6  |  0.62
  0101  |  1.0101  |  0.65625  |  0.7  |  0.66
  0110  |  1.0110  |  0.6875   |  0.7  |  0.69
  0111  |  1.0111  |  0.71875  |  0.7  |  0.72
  1000  |  1.1000  |  0.75     |  0.8  |  0.75
  1001  |  1.1001  |  0.78125  |  0.8  |  0.78
  1010  |  1.1010  |  0.8125   |  0.8  |  0.81
  1011  |  1.1011  |  0.84375  |  0.8  |  0.84
  1100  |  1.1100  |  0.875    |  0.9  |  0.88
  1101  |  1.1101  |  0.90625  |  0.9  |  0.91
  1110  |  1.1110  |  0.9375   |  0.9  |  0.94
  1111  |  1.1111  |  0.96875  |  1.   |  0.97

By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other.

The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.

25
votes

All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)

24, 53, and 64 bit mantissas mean that for a floating-point number between 2k and 2k+1 the next larger number is 2k-23, 2k-52 and 2k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.

So how does that translate into decimal numbers? It depends.

Take k = 0 and 1 ≤ x < 2. The resolution is 2-23, 2-52, and 2-63 which is about 1.19×10-7, 2.2×10-16, and 1.08×10-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!

You get the highest number of decimal digits if x is only a bit less than 2k+1 and only a bit more than 10n, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2k and a bit less than 10n, for example 11024 ≤ x < 11000 . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log10 (2×10).

Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."

9
votes

80x86 code using its hardware coprocessor (originally the 8087) provide three levels of precision: 32-bit, 64-bit, and 80-bit. Those very closely follow the IEEE-754 standard of 1985. The recent standard specifies a 128-bit format. The floating point formats have 24, 53, 65, and 113 mantissa bits which correspond to 7.22, 15.95, 19.57, and 34.02 decimal digits of precision.

The formula is mantissa_bits / log_2 10 where the log base two of ten is 3.321928095.

While the precision of any particular implementation does not vary, it may appear to when a floating point value is converted to decimal. Note that the value 0.1 does not have an exact binary representation. It is a repeating bit pattern (0.0001100110011001100110011001100...) like we are used to in decimal for 0.3333333333333 to approximate 1/3.

Many languages often don't support the 80-bit format. Some C compilers may offer long double which uses either 80-bit floats or 128-bit floats. Alas, it might also use a 64-bit float, depending on the implementation.

The NPU has 80 bit registers and performs all operations using the full 80 bit result. Code which calculates within the NPU stack benefit from this extra precision. Unfortunately, poor code generation—or poorly written code— might truncate or round intermediate calculations by storing them in a 32-bit or 64-bit variable.

8
votes

Is floating point precision mutable or invariant, and why?

Typically, given any numbers in the same power-of-2 range, the floating point precision is invariant - a fixed value. The absolute precision changes with each power-of-2 step. Over the entire FP range, the precision is approximately relative to the magnitude. Relating this relative binary precision in terms of a decimal precision incurs a wobble varying between DBL_DIG and DBL_DECIMAL_DIG decimal digits - Typically 15 to 17.


What is precision? With FP, it makes most sense to discuss relative precision.

Floating point numbers have the form of:

Sign * Significand * pow(base,exponent)

They have a logarithmic distribution. There are about as many different floating point numbers between 100.0 and 3000.0 ( a range of 30x) as there are between 2.0 and 60.0. This is true regardless of the underlying storage representation.

1.23456789e100 has about the same relative precision as 1.23456789e-100.


Most computers implemment double as binary64. This format has 53 bits of binary precision.

The n numbers between 1.0 and 2.0 have the same absolute precision of 1 part in ((2.0-1.0)/pow(2,52).
Numbers between 64.0 and 128.0, also n, have the same absolute precision of 1 part in ((128.0-64.0)/pow(2,52).

Even group of numbers between powers of 2, have the same absolute precision.

Over the entire normal range of FP numbers, this approximates a uniform relative precision.

When these numbers are represented as decimal, the precision wobbles: Numbers 1.0 to 2.0 have 1 more bit of absolute precision than numbers 2.0 to 4.0. 2 more bits than 4.0 to 8.0, etc.

C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double counterparts. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be thought of as the maximum relative decimal precision.

Typically this means given double will have at 15 to 17 decimal digits of precision.

Consider 1.0and its next representable double, the digits do not change until the 17th significant decimal digit. Each next double is pow(2,-52) or about 2.2204e-16 apart.

/*
1 234567890123456789 */
1.000000000000000000...
1.000000000000000222...

Now consider "8.521812787393891"and its next representable number as a decimal string using 16 significant decimal digits. Both of these strings, converted to double are the same 8.521812787393891142073699... even though they differ in the 16th digit. Saying this double had 16 digits of precision was over-stated.

/*
1 234567890123456789 */
8.521812787393891
8.521812787393891142073699...
8.521812787393892
6
votes

No, it is variable. Starting point is the very weak IEEE-754 standard, it only nailed down the format of floating pointer numbers as they are stored in memory. You can count on 7 digits of precision for single precision, 15 digits for double precision.

But a major flaw in that standard is that it does not specify how calculations are to be performed. And there's trouble, the Intel 8087 floating point processor in particular has caused programmers many sleepless nights. A significant design flaw in that chip is that it stores floating point values with more bits than the memory format. 80 bits instead of 32 or 64. The theory behind that design choice is that this allows to be intermediate calculations to be more precise and cause less round-off error.

Sounds like a good idea, that however did not turn out well in practice. A compiler writer will try to generate code that leaves intermediate values stored in the FPU as long as possible. Important to code speed, storing the value back to memory is expensive. Trouble is, he often must store values back, the number of registers in the FPU are limited and the code might cross a function boundary. At which point the value gets truncated back and loses a lot of precision. Small changes to the source code can now produce drastically different values. Also, the non-optimized build of a program produces different results from the optimized one. In a completely undiagnosable way, you'd have to look at the machine code to know why the result is different.

Intel redesigned their processor to solve this problem, the SSE instruction set calculates with the same number of bits as the memory format. Slow to catch on however, redesigning the code generator and optimizer of a compiler is a significant investment. The big three C++ compilers have all switched. But for example the x86 jitter in the .NET Framework still generates FPU code, it always will.


Then there is systemic error, losing precision as inevitable side-effect of the conversion and calculation. Conversion first, humans work in numbers in base 10 but the processor uses base 2. Nice round numbers we use, like 0.1 cannot be converted to nice round numbers on the processor. 0.1 is perfect as a sum of powers of 10 but there is no finite sum of powers of 2 that produce the same value. Converting it produces an infinite number of 1s and 0s in the same manner that you can't perfectly write down 10 / 3. So it needs to be truncated to fit the processor and that produces a value that's off by +/- 0.5 bit from the decimal value.

And calculation produces error. A multiplication or division doubles the number of bits in the result, rounding it to fit it back into the stored value produces +/- 0.5 bit error. Subtraction is the most dangerous operation and can cause loss of a lot of significant digits. If you, say, calculate 1.234567f - 1.234566f then the result has only 1 significant digit left. That's a junk result. Summing the difference between numbers that have nearly the same value is a very common in numerical algorithms.

Getting excessive systemic errors is ultimately a flaw in the mathematical model. Just as an example, you never want to use Gaussian elimination, it is very unfriendly to precision. And always consider an alternative approach, LU Decomposition is an excellent approach. It is however not that common that a mathematician was involved in building the model and accounted for the expected precision of the result. A common book like Numerical Recipes also doesn't pay enough attention to precision, albeit that it indirectly steers you away from bad models by proposing the better one. In the end, a programmer often gets stuck with the problem. Well, it was easy then anybody could do it and I'd be out of a good paying job :)

5
votes

The type of a floating point variable defines what range of values and how many fractional bits (!) can be represented. As there is no integer relation between decimal and binary fraction, the decimal fraction is actually an approximation.

Second: Another problem is the precision arithmetic operations are performed. Just think of 1.0/3.0 or PI. Such values cannot be represented with a limited number of digits - neither decimal, nor binary. So the values have to be rounded to fit into the given space. The more fractional digits are available, the higher the precision.

Now think of multiple such operations being applied, e.g. PI/3.0 . This would require to round twice: PI as such is not exact and the result neither. This will loose precision twice, if repreated it becomes worse.

So, back to float and double: float has according to the standard (C11, Annex F, also for the rest) less bits available, so roundig will be less precise than for double. Just think of having a decimal with 2 fractional digits (m.ff, call it float) and one with four (m.ffff, call it double). If double is used for all calculations, you can have more operations until your result has only 2 correct fractional digits, than if you already start with float, even if a float result would suffice.

Note that on some (embedded) CPUs like ARM Cortex-M4F, the hardware FPU only supports folat (single precision), so double arithmetic will be much more costly. Other MCUs have no hardware floating point calculator at all, so they have to be simulated my software (very costly). On most GPUs, float is also much cheaper to perform than double, sometimes by more than a factor of 10.

5
votes

The storage has a precise digit count in binary, as other answers explain.

One thing to know, the CPU can run operations at a different precision internally, like 80 bits. It means that code like that can trigger :

void Kaboom( float a, float b, float c ) // same is true for other floating point types.
{
    float sum1 = a+b+c;
    float sum2 = a+b;
    sum2 += c; // let's assume that the compiler did not keep sum2 in a register and the value was write to memory then load again.
    if (sum1 !=sum2)
        throw "kaboom"; // this can happen.
}

It is more likely with more complex computation.

4
votes

I'm going to add the off-beat answer here, and say that since you've tagged this question as C++, there is no guarantee whatsoever about precision of floating point data. The vast majority of implementations use IEEE-754 when implementing their floating point types, but that is not required. The only thing required by the C++ language is that (C++ spec §3.9.1.8):

There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.
3
votes

The amount of space required to store a float will be constant, and likewise a double; the amount of useful precision will in relative terms generally vary, however, between one part in 223 and one part in 224 for float, or one part in 252 and 253 for double. Precision very near zero isn't that good, with the second-smallest positive value being twice as big as the smallest, which will in turn be infinitely greater than zero. Throughout the most of the range, however, precision will vary as described above.

Note that while it often isn't practical to have types whose relative precision varies by less than a factor of two throughout its range, the variation in precision can sometimes cause calculations to yield much less accurate calculations than it would appear they should. Consider, for example, 16777215.0f + 4.0f - 4.0f. All of the values would be precisely representable as float using the same scale, and the nearest values to the large one are +/- one part in 16,777,215, but the first addition yields a result in part of the float range where values are separated by one part in only 8,388,610, causing the result to be rounded to 16,777,220. Consequently, subtracting 4 yields 16,777,216 rather than 16,777,215. For most values of float near 16777216, adding 4.0f and subtracting 4.0f would yield the original value unchanged, but the changing precision right at the break-over point causes the result to be off by an extra bit in the lowest place.

0
votes

Well the answer to this is simple but complicated. These numbers are stored in binary. Depending on if it is a float or a double, the computer uses different amounts of binary to store the number. The precision that you get depends on your binary. If you don't know how binary numbers work, it would be a good idea to look it up. But simply put, some numbers need more ones and zeros than other numbers.

So the precision is fixed (same number of binary digits), but the actual precision that you get depends on the numbers that you are using.