1
votes

It is possible to store a pair of 32-bit single precision floating point numbers in the same space that would be taken by a 64-bit double precision number. For example, the XMM registers of the SSE2 instruction set, can store four single precision numbers or two double precision numbers.

By the IEEE 754 standard, the difference between single and double precision is not only the precision per se, but also the available range: 8 and 11 exponent bits respectively.

Intuitively, it seems to me that if you were designing an FPU to process either 2N single precision numbers or N double precision numbers in parallel, the circuit design should be simpler if you deviate from the IEEE standard and make both use the same number of exponent bits. For example, the bfloat16 half precision format, trades away some mantissa bits to keep the same number of exponent bits as single precision; part of the justification given for this, is that it's easier to convert between bfloat16 and single precision.

Do any actual vector instruction sets use the same number of exponent bits for single and double precision? If so, do they stick closer to the 8 bits typical for single precision, or 11 bits typical for double precision?

2
For scalar processing, the DEC VAX initially used this approach with their F (single precision) and D (double precision) formats; both used an 8-bit exponent field. However, the small exponent range caused numerical issues for double-precision computation in some contexts, so a G format (basically IEEE-754 double precision) was added later.njuffa
@njuffa: Interesting! Worth posting as an answer if you want, even though the question for some reason limited itself to SIMD. It does make more sense for scalar FPUs back when transistor budgets were smaller; if you never need that wider exponent then you don't have to build it at all.Peter Cordes
@Peter Cordes The approach used by the VAX was not uncommon in older computers. E.g. the IBM System/360 used a radix-16 floating-point format with a 7-bit binary exponent for both single and double precision. This question is focused on newer SIMD-based architectures; that is why I did not post that little tidbit of information as an answer.njuffa
@njuffa It didn't occur to me that this would also have cropped up with scalar FPUs, which is why I only talked about SIMD, but you're right, that is an interesting example!rwallace

2 Answers

2
votes

AFAIK, nobody does this. Sign-extending and zero-extending are pretty trivial in hardware compared to the transistor cost of building an FPU execution unit overall.

Routing the exponent vs. mantissa bits where they need to go is not a big deal compared to building a multiplier you can use as one 52-bit multiplier or 2 separate 23-bit multipliers. (That way the same transistors can be used for the mantissas of packed-single and packed-double multiplies / FMAs; that's a large fraction of the die areas for an FMA/multiplier unit.)


AFAIK, all CPUs modern enough to have SIMD at all use IEEE-754 formats because that's what people want, and there's no compelling reason to do otherwise. Certainly the vast majority of them use the standard formats.

ARM NEON for example initially didn't support full IEEE 754, but what they left out was gradual underflow (subnormals). They still used IEEE binary32 and binary64 (standard float and double) data formats.

1
votes

Do any actual vector instruction sets use the same number of exponent bits for single and double precision?

I’m not aware of that. However, if you don’t necessarily need vector ones, x87 hardware is doing just that. The hardware has even more bits than double precision, the internal format is 80 bit, they use 15 bits for exponent and 64 bits for mantissa.

The FPU has a control register that specified the precision with 3 possible values, 32, 64 or 80 bit. When set to 32-bit, every instruction rounds the mantissa and truncates the exponent making ±INF or zero.

Modern compilers no longer emit these instructions, instead they use the lowest lane of SSE vector registers.

the circuit design should be simpler if you deviate from the IEEE standard and make both use the same number of exponent bits.

Yes indeed. That’s precisely how Intel was able to launch their 8087 FPU in 1980, the whole chip only has 45k transistors.

However, modern CPUs have a budget for billions of transistors. Simplicity of the design is not the priority anymore; performance and power consumption are.

Speaking of performance, 8087 spends up to 200 cycles to divide two float numbers. My current CPU (AMD Zen2) spends up to 10 cycles to divide 32-bit floats (8 of them at once), and up to 13 cycles to divide 64-bit floats (4 of them at once). Huge improvement from 200 cycles, but the price for that is complexity and transistors count.