floating point addition and subtraction

Question

When performing addition of floating point binary numbers, typically you would change the smaller exponent to match the larger exponent, then adjust the mantissa accordingly. Once the mantissas are aligned they can be added together. The result is then normalised if necessary.

Why do we typically adjust the smaller exponent to match the larger? What not the other way around? When performing these calculations by hand the result is the same whatever the approach.

Eric Postpischil Eric Postpischil · Accepted Answer · 2018-05-16T10:58:38

When adding numbers with the same sign (or subtracting numbers with opposite signs), the result has the same exponent as the greater operand or one more (according to whether carry occured or not). So there is less shifting to do if the smaller number is adjusted to match the larger.

With subtraction of numbers with the same sign (or addition of numbers of opposite signs), cancellation can leave the leading digit in a variety of positions, so there may be less difference between the choices. However, if the smaller number is adjusted to match the larger, only shifting in one direction is needed. If the larger is adjusted, there is an additional decision to make about in which direction a shift is required.

floating point addition and subtraction

2 Answers