1
votes

I have a program which uses fixed point numbers, because the CPU I'm using doesn't support IEEE754 floats.

I've been doing fine with first converting standard IEEE754s into fixed points by finding the exponent, then shifting the number and so on by manually accessing the bits of the said IEE754 float inside the memory. After conversion, I'm able to do fixed-point calculations just fine.

However, is it possible to reconstruct a fixed point (say a Q15.16 integer) back to a IEE754 floating point without FPO so that CPUs with IEEE754/FPO support would be able to read it as their native float type? Is anywhere code or examples of how the CPU's FPO unit actually does this conversion in raw byte manipulation, or is it just some black magic that cannot be done in software? Obviously, I'm not looking for super-precise conversion.

All the answers I've seen until now use FPO. for example, by first calculating 2^(-num_fraction_bits_in_fixed), which already needs FPO, and then scaling the fixed point to that scaling factor.

Edit: By using EOF's answer as a baseline, I was able to create the following code snippet for reconstructing the IEEE754 float from a fixed point integer (In this example the fixed point is a Q31.32, stored inside a INT64). In the end, I just handled the case of 0 manually, since without it the code would actually return a really small, but still a non-zero value.

Here's the code:

static INT32 count_exponent(UINT64 x)
{
    INT32 l = -33;
    for (UINT64 i = 0; i < 64; i++)
    {
        UINT64 test = 1ULL << i;
        if (x >= test)
            l++;
        else
            break;
    }
    return l;
}

UINT32 float_from_fix32(INT64 value)
{
    INT64 original_num = (INT64)value;
    UINT64 sign = 0;
    if (value < 0)
        sign = 1;

    // remove the signed bit if it's set
    INT64 unsigned_ver = value < 0 ? -value : value;

    // calculate mantissa
    int lz = nlz(unsigned_ver);
    UINT64 y = unsigned_ver << (lz + 1);

    // Our fixed-point is 64bits wide. 8 is the exponent bits for IEEE754
    UINT64 mantissa = y >> (33 + 8);

    // get the non-fractal bits, add the exponent bias ( 127 in IEEE754 )
    UINT64 non_fractal = (unsigned_ver >> 32);
    UINT64 exp = count_exponent(unsigned_ver) + 127;

    // construct the final IEEE754 float binary number
    // first add the last 23 bits (mantissa)
    UINT32 ret = mantissa;

    // add exponent
    ret |= (exp << 23);

    // special case of 0
    if(mantissa == 0 && non_fractal == 0)
        ret = 0;

    // add the sign if needed
    if (sign)
        ret |= 0x80000000;

    return ret;
}
1
I actually need to do the opposite of this! So this topic was super helpfulproteneer
@proteneer I have ported the libfixmath to 64 bit machines. If you have 64bit machine without FPO, check out the library: github.com/jussihi/libfixmath64Jussi Hietanen

1 Answers

1
votes

Without loss of generality, consider unsigned fixed-point number x, assuming (loss of generality here) that every number in your fixed-point format is (representable by) a normalized float of the floating-point format:

1) Find the number of leading zeros n (there may be special CPU instructions to do this quickly and without a (software) loop).

2) Shift the number left (y = x << n+1) (to produce a normalized float mantissa), then right (m = y >> (signbit+exponentbits)), this is the mantissa of the float.

3) Take your n, subtract the number of non-fractional bits of the fixed-point format, add the exponent bias of the floating-point format. Shift the biased exponent to the exponent bit-position of the fixed-point result.

4) If the original number was not unsigned, set the sign-bit in the result iff the number was negative.


a) If the fixed-point number is signed v, then convert to unsgined u, and keep the sign s separately (you can copy it to the sign bit of the floating-point number directly). The unsigned input to the above algorithm will be x = v < 0 ? -u : u.

b) exponentbits depends on the floating-point number format. For ieee754 32-bit float, it is 8.

c) A fixed-point format typically represents number by an integer of n bits, which is (conceptually) divided by a constant of 2^m. The non-fractional bits (if any exist) are the bits n - m if n > m.

d) exponent bias is again described by the floating-point format. For ieee754 32-bit float, the bias is 127.