How to convert float to double(both stored in IEEE-754 representation) without losing precision?

7

votes

I mean, for example, I have the following number encoded in IEEE-754 single precision:

"0100 0001 1011 1110 1100 1100 1100 1100"  (approximately 23.85 in decimal)

The binary number above is stored in literal string.

The question is, how can I convert this string into IEEE-754 double precision representation(somewhat like the following one, but the value is not the same), WITHOUT losing precision?

"0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010"

which is ~~the same number~~ encoded in IEEE-754 double precision.

I have tried using the following algorithm to convert the first string back to decimal number first, but it loses precision.

num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127)

I'm using Qt C++ Framework on Windows platform.

EDIT: I must apologize maybe I didn't get the question clearly expressed. What I mean is that I don't know the true value 23.85, I only got the first string and I want to convert it to double precision representation without precision loss.

c++qt floating-pointdoubleieee-754

@tenfour I think it is because he is storing it as string - Caesar

Your second binary string is not the same number as the first, but in double precision. What is the problem you try to solve? - Daniel Fischer

Daniel Fisher is correct; those are in no way the same number (the first is 23.84999847412109375, the second is 23.85000000000000142108547152020037174224853515625). - Stephen Canon

3

votes

Well: keep the sign bit, rewrite the exponent (minus old bias, plus new bias), and pad the mantissa with zeros on the right...

(As @Mark says, you have to treat some special cases separately, namely when the biased exponent is either zero or max.)

2

votes

IEEE-754 (and floating point in general) cannot represent periodic binary decimals with full precision. Not even when they, in fact, are rational numbers with relatively small integer numerator and denominator. Some languages provide a rational type that may do it (they are the languages that also support unbounded precision integers).

As a consequence those two numbers you posted are NOT the same number.

They in fact are:

10111.11011001100110011000000000000000000000000000000000000000 ... 10111.11011001100110011001100110011001100110011001101000000000 ...

where ... represent an infinite sequence of 0s.

Stephen Canon in a comment above gives you the corresponding decimal values (did not check them, but I have no reason to doubt he got them right).

Therefore the conversion you want to do cannot be done as the single precision number does not have the information you would need (you have NO WAY to know if the number is in fact periodic or simply looks like being because there happens to be a repetition).

2

votes

First of all, +1 for identifying the input in binary.

Second, that number does not represent 23.85, but slightly less. If you flip its last binary digit from 0 to 1, the number will still not accurately represent 23.85, but slightly more. Those differences cannot be adequately captured in a float, but they can be approximately captured in a double.

Third, what you think you are losing is called accuracy, not precision. The precision of the number always grows by conversion from single precision to double precision, while the accuracy can never improve by a conversion (your inaccurate number remains inaccurate, but the additional precision makes it more obvious).

I recommend converting to a float or rounding or adding a very small value just before displaying (or logging) the number, because visual appearance is what you really lost by increasing the precision.

Resist the temptation to round right after the cast and to use the rounded value in subsequent computation - this is especially risky in loops. While this might appear to correct the issue in the debugger, the accummulated additional inaccuracies could distort the end result even more.

1

votes

It might be easiest to convert the string into an actual float, convert that to a double, and convert it back to a string.

-1

votes

Binary floating points cannot, in general, represent decimal fraction values exactly. The conversion from a decimal fractional value to a binary floating point (see "Bellerophon" in "How to Read Floating-Point Numbers Accurately" by William D.Clinger) and from a binary floating point back to a decimal value (see "Dragon4" in "How to Print Floating-Point Numbers Accurately" by Guy L.Steele Jr. and Jon L.White) yield the expected results because one converts a decimal number to the closest representable binary floating point and the other controls the error to know which decimal value it came from (both algorithms are improved on and made more practical in David Gay's dtoa.c. The algorithms are the basis for restoring std::numeric_limits<T>::digits10 decimal digits (except, potentially, trailing zeros) from a floating point value stored in type T.

Unfortunately, expanding a float to a double wrecks havoc on the value: Trying to format the new number will in many cases not yield the decimal original because the float padded with zeros is different from the closest double Bellerophon would create and, thus, Dragon4 expects. There are basically two approaches which work reasonably well, however:

As someone suggested convert the float to a string and this string into a double. This isn't particularly efficient but can be proven to produce the correct results (assuming a correct implementation of the not entirely trivial algorithms, of course).
Assuming your value is in a reasonable range, you can multiply it by a power of 10 such that the least significant decimal digit is non-zero, convert this number to an integer, this integer to a double, and finally divide the resulting double by the original power of 10. I don't have a proof that this yields the correct number but for the range of value I'm interested in and which I want to store accurately in a float, this works.

One reasonable approach to avoid this entirely issue is to use decimal floating point values as described for C++ in the Decimal TR in the first place. Unfortunately, these are not, yet, part of the standard but I have submitted a proposal to the C++ standardization committee to get this changed.

How to convert float to double(both stored in IEEE-754 representation) without losing precision?

5 Answers