A problem about IEEE 754 floating point operation

Question

I want to ask an operation question about IEEE 754 floating-point:

(Take IEEE 754 single-precision floating-point number as an example : 1 sign bit, 8 exponent bits, 23 mantissa bit)

When calculating the addition and subtraction of two floating-point numbers, the mantissa with a small exponent should be aligned with the mantissa with a large exponent.

That is, it depends on the exponent difference between two floating-point numbers to see how much mantissa shifts

Here comes my problem : If the mantissa with the smaller exponent is beyond the range that mantissa can express after the shift.

Should we count the bits that exceed the range together, or do we have to discard them?

For example: I want to calculate the subtraction of two floating point numbers

The first operand: "0"(sign) 10010011(exponent) 0000 0000 0000 0000 1111 111(mantissa)

The second operand: "1"(sign) "10001110"(exponent) "0000 0000 0000 0111 1111 111"(mantissa)

The exponent of the first number is 147 in decimal, and the exponent of the second number is 142 in decimal, and 147-127 (bias) = 20,142-127 = 15

So in fact the above two numbers can become:

The first operand: 1.0000 0000 0000 0000 1111 111 * 2 ^ 20

The second operand: -1.0000 0000 0000 0111 1111 111 * 2 ^ 15

Because the second number is five less than the power of the first number, it needs to be shifted 5 bits to the right, Then my question is, it will become:

All bits are reserved, so a total of 28 bits are required to represent mantissa -0.0000 1000 0000 0000 0011 111 "1 1111"(These five bits exceed 23bit) * (2 ^ 20)
Cut off directly after exceeding 23bit, so satisfying 23bit means mantissa -0.0000 1000 0000 0000 0011 111 * (2 ^ 20)
Add three bits of round, guard, and sticky to consider, so use 25bit to represent mantissa. -0.0000 1000 0000 0000 0011 111 11 =>the last two bits(bit 24 and bit 25) are guard bit and round bit,and set S = 1 (because the last three 1s are cut off)

Which one of the above options is right or none of them is right?

@old_timer Sorry,I don't quite understand what you mean, can you please make your comment clearer? Thank you — shan w
while specific implementations are an interpretation of the spec, you can still try this with specific implementations, should be somewhat simple. and see what the results are (need to subtract not add). — old_timer

Eric Postpischil Eric Postpischil · Accepted Answer · 2020-05-31T09:41:02

Per IEEE 754, all bits are always considered. The result produced by the operation is the same as if you computed the full result with real-number arithmetic and then rounded that to fit into the floating-point format using whichever rounding rule is in effect. (Round to nearest, ties to even low bit/digit, is common, but there are other options for rounding, such as always upward, always downward, toward zero, and always round any non-zero amount toward an odd low bit.)

This does not mean the computer always has to compute the full real number result. For addition and subtraction, using round, guard, and sticky bits suffices to get the required answer. For other operations, more complicated algorithms may be needed. The requirement is merely that the computer has to figure out what you would get if you computed the full real-number result and rounded it—it does not actually have to figure out the full real-number result.

(“Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Mantissas are logarithmic; adding to a mantissa multiplies the number represented. Significands are linear; adding to a significand adds to the number represented.)

A problem about IEEE 754 floating point operation

2 Answers