3
votes

I have a question about the pack754() function defined in Section 7.4 of Beej's Guide to Network Programming.

This function converts a floating point number f into its IEEE 754 representation where bits is the total number of bits to represent the number and expbits is the number of bits used to represent only the exponent.

I am concerned with single-precision floating numbers only, so for this question, bits is specified as 32 and expbits is specified as 8. This implies that 23 bits is used to store the significand (because one bit is sign bit).

My question is about this line of code.

    significand = fnorm * ((1LL<<significandbits) + 0.5f);

What is the role of + 0.5f in this code?

Here is a complete code that uses this function.

#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros

uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
    long double fnorm;
    int shift;
    long long sign, exp, significand;
    unsigned significandbits = bits - expbits - 1; // -1 for sign bit

    if (f == 0.0) return 0; // get this special case out of the way

    // check sign and begin normalization
    if (f < 0) { sign = 1; fnorm = -f; }
    else { sign = 0; fnorm = f; }

    // get the normalized form of f and track the exponent
    shift = 0;
    while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
    while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
    fnorm = fnorm - 1.0;

    // calculate the binary form (non-float) of the significand data
    significand = fnorm * ((1LL<<significandbits) + 0.5f);

    // get the biased exponent
    exp = shift + ((1<<(expbits-1)) - 1); // shift + bias

    // return the final answer
    return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}

int main(void)
{
    float f = 3.1415926;
    uint32_t fi;

    printf("float f: %.7f\n", f);

    fi = pack754(f, 32, 8);
    printf("float encoded: 0x%08" PRIx32 "\n", fi);

    return 0;
}

What purpose does + 0.5f serve in this code?

2
Something tells me that you didn't copy that line correctly. In particular, I think you added an extra set of parentheses.user3386109
@user3386109 The referenced link has the same comment and code significand = fnorm * ((1LL<<significandbits) + 0.5f); and same types. Agree if would work better as significand = fnorm * (1LL<<significandbits) + 0.5f;.chux - Reinstate Monica

2 Answers

3
votes

The code is an incorrect attempt at rounding.

long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f);  // bad code

The first clue of incorrectness is the f of 0.5f, which indicates float, is a nonsensical introduction of specifying float in a routine with long double f and fnorm. float math has no application in the function.

Yet adding 0.5f does not mean that the code is limited to float math in (1LL<<significandbits) + 0.5f. See FLT_EVAL_METHOD which may allow higher precision intermediate results and have fooled the code author in testing.

A rounding attempt does make sense as the argument is long double and the target representations are narrower. Adding 0.5 is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f hinted that the intent was "obvious" - not subtle, albeit incorrect.

As commented, moving the 0.5 is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float math, (it is long double math adding a long doubleproduct to float causes the 0.5f to be promoted to long double first).

// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;

// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5

To round, without calling the preferred <math.h> rounds routines like rintl(), roundl(), nearbyintl(), llrintl(), adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.

Consider

long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5;  // double rounding?

product + 0.5 itself may go through a rounding before truncation/assignment to long long - in effect double rounding.

Best to use the right tool in the C shed of standard library functions.

significand = llrintl(fnorm * (1ULL<<significandbits));

A corner case remains with this rounding is where significand is now one too great and significand , exp needs adjustment. As well identified by @Nayuki, code has other short-comings too. Also, it fails on -0.0.

2
votes

The + 0.5f serves no purpose in the code, and may be harmful or misleading.

The expression (1LL<<significandbits) + 0.5f results in a float. But even for the small case of significandbits = 23 for single-precision floating-point, the expression evaluates to (float)(223 + 0.5), which rounds to exactly 223 (round half even).

Replacing + 0.5f with + 0.0f results in the same behavior. Heck, drop that term entirely, because fnorm will cause the right-hand side argument of * to be casted to long double anyway. This would be a better way to rewrite the line: long long significand = fnorm * (long double)(1LL << significandbits);


Side note: This implementation of pack754() handles zero correctly (and collapses negative zero to positive zero), but mishandles subnormal numbers (wrong bits), infinities (infinite loop), and NaN (wrong bits). It's best to not treat it as a reference model function.