3
votes

I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).

But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!

The langugage I'm using is C++ by the way.

Edit: Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.

5
You haven't told us what the problem is. You're just listing your possible solutions for some unknown goal. Did you want us to tell you how to bitwise round a floating-point number?Lightness Races in Orbit
Exactly what type of rounding are you hoping to do?NPE
Why don't you use standard library functions?David Heffernan
@DavidHeffernan That will be the best. But I don't know if there is any fit my purpose. I just updated my question. Can you help to have a look? Thanks!Archer
What do you mean by round to X.5, should 2.1 be rounded to 2.0 or 2.5?aka.nice

5 Answers

6
votes

If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.

The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.

Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.

5
votes

You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.

As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.

3
votes

You want to round x to the nearest value of the form d.5. For a generan number you write:

round(x+0.5)-0.5

For a number close to d.5, less than 0.25 away, you can use Pascal's offering:

round(2*x)*0.5
1
votes

If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):

void RoundNearestHalf(double &d) {
    unsigned const          maskshift  = ((*(unsigned __int64*)&d >> 52) - 1023);
    unsigned __int64 const  setmask    =  0x0008000000000000 >> maskshift;
    unsigned __int64 const  clearmask  = ~0x0007FFFFFFFFFFFF >> maskshift;
    *(unsigned __int64*)&d            |= setmask;
    *(unsigned __int64*)&d            &= clearmask;
}

maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.

Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.

0
votes

I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.

I think it should be pointed out that there are functions to handle many types of rounding for you.