5
votes

I know from articles like "Why you should never cast floats to ints" and many others like it that casting a float to a signed int is expensive. I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer.

Before anyone says "Why don't you just test it?" I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. Is there something inherent to the algorithm for conversion that affects performance in general?

Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons:

  • Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, e.g. 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision.

  • There is no need to check for NAN or +/- INF or +/- 0.

  • There is no danger of overflow or underflow.

What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? Is conversion from int to float generally cheaper than float to int?

2
On some architectures, floating-point and integer data are stored in physically separate register files. An integer operand in a register cannot be transferred/converted directly to a floating-point operand in a register; instead the process involves storing the integer data to memory first, then reading back from memory and converting to floating-point. This may make it slow(er) even if the processor uses store-to-load forwarding. The same applies to transfer in the opposite direction (floating-point converted to integer), which can give rise to additional issues, as you already noted.njuffa
@njuffa: Doesn't the effect you mention apply to reinterpret_cast, and not conversion?Ben Voigt
@BenVoigt I did mean conversion. For example on classical x86 with the x87 FPU, you would need something like this: mov eax, dword ptr [tmp]; fld dword ptr [tmp] to convert a 32-bit signed integer in an integer register to a floating-point value in an FPU register.njuffa
@njuffa: That instruction sequence implements a reinterpret_cast, not a conversion. Did you mean fild ?Ben Voigt
@BenVoigt Yes, I meant fild dword ptr [tmp]. Sorry for the confusion, I haven't done any x86 assembly programming in a dozen years or so.njuffa

2 Answers

3
votes

Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. This can be accepted as a good example.

From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. A naive but rather good explanation follows. In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary.

The first clock cycle with appropriate hardware assistance (priority encoder) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). Otherwise, if the input integer was negative, it shall be negated; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). So, 1-2 cycles are spent.

At the same time, it can calculate the biased exponent prediction, based on CLZ result. Notice we needn't take care of denormalized values or infinity. (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.)

Then, shift is applied (1 cycle again, with barrel shifter) to place the integer value so its highest 1 is at a fixed position (e.g. with standard 3 extension bits and 24-bit mantissa, this is bit number 26). This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher). Now, up to 3 cycles.

Then, rounding is applied. Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). Rounding to zero (RZ) results in ignoring guard/round/sticky bits. Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; x101...x111 -> add 1; 0100 -> discard; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now.

This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). So, 5 cycles now.

Is conversion from int to float generally cheaper than float to int?

We can estimate the same conversion e.g. from binary32 to int32 (signed). Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). In that case:

Split and analyze the input value: S - sign; BE - biased exponent; M - mantissa (significand); also apply rounding mode. A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; or, if BE < 126 (abs(x) < 0.5) with RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; x > 0 and RPI; BE = 126 and RNE. All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2.

If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). Get 157-BE from cycle 1, this is needed shift amount. Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. Here, 2 cycles are spent.

Apply rounding (see above). 3 cycles spent, and carry can be produced. Here, we can again detect integer overflow and produce the respective result value. Forget additional bits, only 31 bits are valued now.

Finally, negate the resulting value, if x was negative (sign=1). Up to 4 cycles spent.

I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production).

You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. Start with ui32_to_f32.c source file. They use more additional bits for intermediate values, but this isn't principal.

0
votes

See @Netch's excellent answer re the algorithm, but it's not just the algorithm. The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. But when storing FP to integer, there has to be an FWAIT (Intel).