1
votes

For the sake of simplicity I will be using and requesting the use of 8 bit floats. Also, ignore the sign bit.

In our Numerical Methods class, we're learning one type of floating point representation in our theory classes and another in our lab classes. We have different teachers for either and they do not collaborate on topics discussed in successive classes.

In the theory class we were told that floats are represented like this:

enter image description here

where d_1 is always 1. No further conditions/constraints were told. Let's call this A.

In the lab class, we were taught the IEEE-754 format:

enter image description here

where e becomes 1 only if it's 000, if it's 111 and mantissa is 0000, then it's infinity, and if it's 111 and mantissa is XXXX, then it's not a number. Let's call this B.

Here's what I understood, when it comes to finding the smallest non-zero number.

In A, e becomes e_min - 3 which is simply 0-3. Meaning, the overall number is 0.1 * 2^-3 which is 2^-4.

But in B, the smallest non-zero normal is 1 * 2^(1-3) which is 2^-2; and the smallest non-zero denormal is 0.0001 * 2^(1-3) which is 2^-4 * 2^-2 which is 2^-6.

They don't match, even if they are both supposed to be correct forms of representations. Every other source I can find either only follows the IEEE-754 format, or simply states that a regular number can be represented in different ways by simply changing the position of the decimal point and the exponent. But none tell me how they are related, such as this man here from 21:50 onward.

Where am I going wrong? How can I get the same values? How are they related?

1
Version A looks like the VAX floating-point format (single precision: F_floating, double precision: G_floating), with the mantissa normalized to 0.1mmmm..., so in [0.5, 1), and the leading '1' implicit. The exponent bias was 128 and 1024, respectively (differs by IEEE-754 by one). As I recall there were no denormals in VAX floating-point formats.njuffa

1 Answers

1
votes

I agree that they do not match.

"A" is the way most binary floating-point numbers worked before the advent if IEEE-754.

There were a lot of edge cases that were not handled well with that. So, along came 754 ("B") in the early '80s.

  • Previously, "normalization" was optional. That is d1 did not have to be 1. "A" is a bit strange because I don't think any hardware forced normalization after an operation. (Normalization was a costly operation then.)
  • Denormalized numbers and "Gradual underflow". This adds a bunch of complications to the understanding and hardware implementation, but the Mathematicians like it. These numbers are disallowed in "A".
  • Picking a value for "e" so that almost all inverses (1/x) don't over/underflow.
  • The leading 1 (of "B") was effectively a "free" bit, since it is not actually in the representation, thereby gaining a little more precision than all predecessors had without loss of exponent range. (Note: This can only be done for base-2, not base-16 (IBM-360) base-10, etc.) It is unclear whether "A" hides d1.
  • Infinity and NaN stole the max biased exponent value (minor loss). Perhaps only CDC had such concepts before that.

The smallest number:

A: 0.100...00 with A's minimal exponent
B: 0.000...01 with B's minimal exponent

Virtually every commercial floating-point implementation follows IEEE-754; is "A" there for ancient history.