binary floating point addition algorithm

Question

I'm trying to understand IEEE 754 floating point addition at a binary level. I have followed some example algorithms that I have found online, and a good number of test cases match against a proven software implementation. My algorithm is only dealing with positive numbers at the moment. However, I am not getting a match with this test case:

00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)

I split it up into sign bit, exponent, mantissa. I add back in the implicit 1 to the mantissa

0 00010001 1.11100110110010010011100
0 00000000 1.00011000111111010000100

I subtract the larger exponent from the smaller in order to determine the realignment-shift amount:

 00010001 (17)
-00000000 (0)
 =============
           17

I tack on a Guard bit, Round Bit, and Sticky Bit to the mantissas:

1.11100110110010010011100 000
1.00011000111111010000100 000

I shift the lesser value's mantissa to the right 17 times, with the LSb "sticking" once it receives a 1:

0.00000000000000001000110 001

I add the greater mantissa to the shifted lesser mantissa:

1.11100110110010010011100 000 +
0.00000000000000001000110 001
================================
1.11100110110010011100010 001

Since there was no overflow, and the guard bit is 0, I can use the summation-mantissa and greater-exponent directly (re-removing the implicit '1'):

0 00010001 11100110110010011100010

Giving a final value of:

00001000111100110110010011100010 (1.46487e-33)

But according to my verification implementation, I should be getting:

00001000111100110110010010101000 (1.46487e-33)

So very close but not exact. Is there a mistake in my algorithm?

Zero exponent means subnormal number. There is no implicit one bit. — Patricia Shanahan
The subnormal error accounts for a one bit difference in the final result, 00001000111100110110010010100010. It does not explain the different location of the least significant one bit in the two answers. — Patricia Shanahan

Patricia Shanahan Patricia Shanahan · Accepted Answer · 2018-08-05T01:02:12

There appear to be two problems in the calculation, both related to treating a subnormal number as though it were normal:

Incorrect shift calculation. The exponent is -126, not -127.
Incorrectly inserting a one bit before the binary point.

Here is the revised calculation:

0 00010001 1.11100110110010010011100
0 00000000 0.00011000111111010000100

Tack on a Guard bit, Round Bit, and Sticky Bit to the mantissas:

1.11100110110010010011100 000
0.00011000111111010000100 000

16 bit right shift of smaller number.

0.00000000000000000001100 001

Add the greater mantissa to the shifted lesser mantissa:

1.11100110110010010011100 000 +
0.00000000000000000001100 001
================================
1.11100110110010010101000 001

binary floating point addition algorithm

1 Answers