0
votes

I want to ask an operation question about IEEE 754 floating-point:

(Take IEEE 754 single-precision floating-point number as an example : 1 sign bit, 8 exponent bits, 23 mantissa bit)

When calculating the addition and subtraction of two floating-point numbers, the mantissa with a small exponent should be aligned with the mantissa with a large exponent.

That is, it depends on the exponent difference between two floating-point numbers to see how much mantissa shifts

Here comes my problem : If the mantissa with the smaller exponent is beyond the range that mantissa can express after the shift.

Should we count the bits that exceed the range together, or do we have to discard them?

For example: I want to calculate the subtraction of two floating point numbers

The first operand: "0"(sign) 10010011(exponent) 0000 0000 0000 0000 1111 111(mantissa)

The second operand: "1"(sign) "10001110"(exponent) "0000 0000 0000 0111 1111 111"(mantissa)

The exponent of the first number is 147 in decimal, and the exponent of the second number is 142 in decimal, and 147-127 (bias) = ​​20,142-127 = 15

So in fact the above two numbers can become:

The first operand: 1.0000 0000 0000 0000 1111 111 * 2 ^ 20

The second operand: -1.0000 0000 0000 0111 1111 111 * 2 ^ 15

Because the second number is five less than the power of the first number, it needs to be shifted 5 bits to the right, Then my question is, it will become:

  1. All bits are reserved, so a total of 28 bits are required to represent mantissa -0.0000 1000 0000 0000 0011 111 "1 1111"(These five bits exceed 23bit) * (2 ^ 20)

  2. Cut off directly after exceeding 23bit, so satisfying 23bit means mantissa -0.0000 1000 0000 0000 0011 111 * (2 ^ 20)

  3. Add three bits of round, guard, and sticky to consider, so use 25bit to represent mantissa. -0.0000 1000 0000 0000 0011 111 11 =>the last two bits(bit 24 and bit 25) are guard bit and round bit,and set S = 1 (because the last three 1s are cut off)

Which one of the above options is right or none of them is right?

2
@old_timer Sorry,I don't quite understand what you mean, can you please make your comment clearer? Thank youshan w
while specific implementations are an interpretation of the spec, you can still try this with specific implementations, should be somewhat simple. and see what the results are (need to subtract not add).old_timer

2 Answers

2
votes

Per IEEE 754, all bits are always considered. The result produced by the operation is the same as if you computed the full result with real-number arithmetic and then rounded that to fit into the floating-point format using whichever rounding rule is in effect. (Round to nearest, ties to even low bit/digit, is common, but there are other options for rounding, such as always upward, always downward, toward zero, and always round any non-zero amount toward an odd low bit.)

This does not mean the computer always has to compute the full real number result. For addition and subtraction, using round, guard, and sticky bits suffices to get the required answer. For other operations, more complicated algorithms may be needed. The requirement is merely that the computer has to figure out what you would get if you computed the full real-number result and rounded it—it does not actually have to figure out the full real-number result.

(“Significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Mantissas are logarithmic; adding to a mantissa multiplies the number represented. Significands are linear; adding to a significand adds to the number represented.)

0
votes

Where I was headed with my (deleted) comment was in the wrong direction.

Now while each implementation is subject to possible incorrect interpretation of the spec and bugs(and historically there have been lots of floating point implementation bugs, not just intel one time), we can or can try to examine one implementation. (my computer)

Start with one operand 1.0

0x3F800000

001111111000...
0 01111111 000...
1.0000000 no shift

Then choose an operand that is going to have to have its mantissa shifted in order to perform addition and subtraction.

0x3BFFFFFF

00111011111111
0 01110111 11111

That is going to shift right 8, so looking at some second operands

0x3BFFFFFF

  1.000000000000000...00 0000...
+ 0.000000011111111...11 1111...
==============================
  1.000000011111111...11

  1.000000000000000...00 0000...
- 0.000000011111111...11 1111...
==============================
  1.111111100000000...00

0x3BFFFF00  

  1.000000000000000...00 0000...
+ 0.000000011111111...11 0000...
==============================
  1.000000011111111...11

  1.000000000000000...00 0000...
- 0.000000011111111...11 0000...
==============================
  1.111111100000000...10

0x3BFFFF80  

  1.000000000000000...00 0000...
+ 0.000000011111111...11 1000...
==============================
  1.000000011111111...11

  1.000000000000000...00 0000...
- 0.000000011111111...11 1000...
==============================
  1.111111100000000...01

0x3BFFFFC0  

  1.000000000000000...00 0000...
+ 0.000000011111111...11 1100...
==============================
  1.000000011111111...11

  1.000000000000000...00 0000...
- 0.000000011111111...11 1100...
==============================
  1.111111100000000...00

0x3BFFFF01  

  1.000000000000000...00 00000000
+ 0.000000011111111...11 00000001
=================================
  1.000000011111111...11

  1.000000000000000...00 00000000
- 0.000000011111111...11 00000000
=================================
  1.111111100000000...00 00000001

For addition (without rounding) the base number the number not shifted needs to be padded (with zeros). so two bits past the end of the size of the mantissa

0+0 = 0 carry 0
0+1 = 1 carry 0

you cannot have a carry in on the first bit past the mantissa (sticky bit). so for addition there is no reason for additional logic past that first bit, you need that first bit for rounding though. Only takes one bit to round.

Subtraction though you can look at it as a borrow or...

0x3BFFFF80

  1.000000000000000...00 0000...
- 0.000000011111111...11 1000...
==============================
  1.111111100000000...01

is really in logic

                                1
  1.000000000000000...00 00000000
+ 1.111111100000000...00 01111111
====================================
 10.111111100000000...00 10000000
hardware gives
  1.111111100000000...01

which I am still wrapping my head around because I chose both round to zero and round down, so it shouldnt have rounded up and/or how did that bit get there.

Anyway, I was on the wrong path, with subtraction, those bits do matter because the carry in can now be non-zero going into that first bit past the mantissa edge the first operands zero extension is still expected to be padded with zeros, but for subtraction if you have a bunch of ones then add in the carry bit of one you can push that carry all the way up to the edge of the mantissa.

Okay so I am either being affected by rounding or I have my boundary wrong (represented my second operand incorrectly)

0x3BFFFF80  

  1.000000000000000...00 0000...
- 0.000000011111111...11 1000...
==============================
  1.111111100000000...01

0x3BFFFFC0  

  1.000000000000000...00 0000...
- 0.000000011111111...11 1100...
==============================
  1.111111100000000...00

                                1
  1.000000000000000...00 00000000
+ ?.111111100000000...00 00111111
=================================

                          1111111
  1.000000000000000...00 00000000
+ ?.111111100000000...00 00111111
=================================
  1.111111100000000...00 01000000
hardware gives
  1.111111100000000...00

Either way bug in the demonstration or not because of borrowing during subtraction off the end of the mantissa, that can affect both the rounding bit as well as the lsbit of the result (within the mantissa range before normalization).

So the answer is yes those bits have to be considered. Basically see and vote for Eric's answer.

You should be able to demonstrate this on other not-grossly-broken implementations, including a software optimizer perhaps if you can get the conversion right from the high level language to the specific floating point value in binary.

But also when thinking about it on the addition side you cant get any carry outs down there so you cant directly change bits in the fraction before normalization, but of course the fraction from the smaller number directly affects/defines rounding. Subtraction is the key here as it can affect both rounding and the carry bit into the pre-normalized fraction.

And as that answer comments, yes in logic there can be shortcuts to not have to have an adder that large, and also based on that answer, significant or fraction instead of mantissa, sorry...been using the old term since back when it wasnt the old term.