IEEE-754 Floating Point Exponent Alignment Issue

Question

I'm making a floating point calculator from the ground up basically, and I'm having an issue with the part where you align the exponents of two numbers in the case that they are not equal.

For instance: 75.2 + 12.25 = 84.75

But my program is instead returning 106.5

Here is the code for the function that aligns the exponents:

void align(MyStruct* a, MyStruct* b)
{
   if (a->exponent > b->exponent)
   {
      b->exponent = a->exponent; // Sets the exponent of b = to a 
      b->fraction >>= a->exponent - b->exponent // Shifts the mantissa (fraction) bits of b to the right
   }
   return;
}

I don't know what I'm doing wrong here. The binary representation for the example equation above is as shown:

0|10000101|00100010000000000000000 A

0|10000010|10001000000000000000000 B +

When I do b->exponent = a->exponent;, I'm expecting it to make b

0|10000101|10001000000000000000000, which goes smoothly. Then I expect the mantissa portion of b to be shifted right as many times is necessary to make up for the added bits that go past the 23 bit limit (in this case, it's 3) This also happens without issue, leaving b to become 0|10000101|00010001000000000000000

As far as this, I would expect to get the correct results. However it does not produce the correct number. Looking into it further with other floating point calculators online, it appears that the result of a + b is represented as 0|10000101|01010011000000000000000 in binary.

However, when adding my two modified mantissas together, that is not the result I get. What am I doing wrong here? The only thing I suspect is that the hidden bit (the 1) is not being shifted during the process. Is this the case?

I should mention that my structs are composed of three integer variables, each of which represent the individual parts of the IEEE-754 floating point formation (sign, exponent, fraction/mantissa). So the mantissa for A for example would be 00000000000100010000000000000000 (32 bits instead of 23, but when they're all parsed it becomes the full representation of the float). Also, I am pretty positive that my other functions are working as intended, and that the align is the issue here.

Any advice?

Are you remembering to put in the implicit 1 before the binary point? That should be done before shifting the significand. — Patricia Shanahan
@PatriciaShanahan My understanding was that this bit was not actually there, no? It was just understood that it was there--which would be more efficient seeing as it would allow for more precision. — EthanR
Well, yes, it's implicit, but it should still participate in the shift. When you shift 1.1 right by one bit, it should become 0.11 but you are turning it into 1.01 — Igor Tandetnik
So this bit wouldn't be included in the fraction though, I would just simulate it's existence, correct? By that I mean my variable which holds the fraction int fraction for example, will wouldn't be 000000001|10000000000000000000000. But rather 000000000|10000000000000000000000, and I would just simulate the bit past the 23rd place? — EthanR
Personally, I'd probably extract the pieces out of IEEE representation into local variables with more bits to spare (materializing the implicit bit at this point), do the math, and then pack the result back into the representation (which involves normalizing it and removing the high bit, so it's implicit again). It doesn't make sense to me to try and stay within IEEE the whole time, since you need denormal numbers as intermediate results, and IEEE representation is not designed for that. — Igor Tandetnik

EthanR EthanR · Accepted Answer · 2020-07-26T12:00:05

I believe the calculation would have been wrong even if I did not fix the issue to begin with because I was shifting based on the difference between the exponents, however that would mean I'm shifting 0 times since I set the exponents equal to one another. So that was a silly oversight by me.
The actual issue was resolved by setting the 24th bit in the mantissa being shifted. The bit technically doesn't exist, but as someone pointed out, it is implied to be there and will be moved over when the shifting occurs.

The fixed code would be as:

void align(MyStruct* a, MyStruct* b)
{
    if (a->exponent != b->exponent) // If the exponents are not equal
    {
        if (a->exponent > b->exponent)
        {
            int disp = a->exponent - b->exponent; // number of shifts needed based on difference between two exponents
            a->fraction |= 1 << 23; // sets the implicit bit for shifting
            b->exponent = a->exponent; // sets exponents equal to each other
            a->fraction >>= disp; // mantissa is shifted over to accommodate for the increase in power
            return;
        }
        int disp = b->exponent - a->exponent;
        a->fraction |= 1 << 23;
        a->exponent = b->exponent;
        a->fraction >>= disp;
        return;
    }
    return;
}

Thanks to those that helped!

IEEE-754 Floating Point Exponent Alignment Issue

1 Answers