0
votes

I want to get familiar and comfortable with floating-point numbers. I'm doing a project that would hopefully help me achieve this by creating dynamically allocated, arbitrarily sized, floating point numbers in C++. I've looked through the IEEE-754 specifications for the standard floating point definitions but I could not find a common correlation between them (I used references from wikipedia on 32, 64, and 128 bit floating point numbers). So my question is: Is there a common pattern between floating point numbers that can be applied to any arbitrarily sized floating point number?

If not, from a programming perspective, would it be easier to define my own floating point representation that does have a pattern?

EDIT: By pattern I mean bits in the mantissa and exponent.

3
By "pattern" do you mean the relationship between the number of mantissa and exponent bits?Oliver Charlesworth
@OliverCharlesworth Yes, I should have clarified that. The number of bits in the mantissa and exponent.Max
Ok. I guess that each was tuned individually to satisfy a trade-off between a number of potential use cases, rather than to fit a mathematical pattern.Oliver Charlesworth

3 Answers

2
votes

There is no mandated mathematical rule for the numbers of bits in the significand1 or the exponent. IEEE 754-2008 does show a formula that describes its listed interchange formats for certain sizes, but this is in a non-normative note:

  • For a storage width k bits, the number of bits in the significand (the mathematical significand with the leading bit, not the field that primarily encodes it without the leading bit), p, is k−round(4×log2(k))+13.
  • The number of bits in the exponent field, w, is kp.

The formula does not hold for 16 or 32 bits; it is only said to hold for 64 bits and widths that are multiples of 32 greater than or equal to 128 (so not widths 32 or 96). I suppose you can consider it a suggestion for larger sizes, but it is not binding.

As far as I know, the parameters specified in table 3.5 of clause 3.6 of IEEE 754-2008 arise from striking balances and historic usage. You can define formats with other parameters as described in clause 3.7. 3.7 gives some recommendations for defining extended precisions using parameters of the precision (digits in the significand) and maximum exponent or just the precision. Or you can disregard IEEE 754 and define your own formats. The standards are not mandatory, and what your design should be is a function of what the goals are.

Note

1 “Significand” is the preferred term for the fraction part of a floating-point number. “Mantissa” is a term for the fraction part of a logarithm. Significands are linear (if the number increases by a factor of 1.2, the significand increases by a factor of 1.2, unless an exponent threshold is crossed), mantissas are logarithmic.

3
votes

2008 version of IEEE754 defines that interchange formats wider than 128 bits shall follow a common approach.

For binary formats, full width k shall be multiple of 32 bits, and number of exponent field bits shall be round(4 * log2 (k)) – 13. One can ascertain that this formula gives also proper values for 64- and 128-bit formats, but not for 16- or 32-bit one (their exponents are wider).

For decimal formats, full width k shall be multiple of 32 bits, and number of combination field bits shall be k / 16 + 9. This also gives the actual numbers for 32-, 64- and 128-bit formats.

All other properties of formats and operations on them shall keep unchanged, including significand interpretation, exponent bias and interpretation, target rounding, and so on. If you

could not find a common correlation between them

you are likely confused by lack of visible logic in defining of field widths. Yep, they are empirical, i.e. adapted more for the accumulated experience in number handling and requirement to fit more data into small room, than for a common logic.

On the other hand, you are not limited to these standard restrictions. Moreover, as soon as IEEE is mainly targeted to hardware design and IEEE754 standard is designed for easiness of hardwired implementation, you needn't follow its restrictions and can utilize any software implementation (as GMP or MPFR). A software implementationʼs advantage is lack of spending time to unpack numbers for calculations and pack them back.

2
votes

IEEE-754 binary types specify the exponent bit width as below.

FP bit size  Expo bit size
16           5
32           8
64           11
128          15
256          19

The remainder of the type uses 1 sign bit and signifcand.

Per this good answer @Netch, the exponent bit width is round(4 * log2 (k)) – 13 for multiples of 32 and up.

An empirical answer to "Is there a common pattern between floating point numbers that can be applied to any arbitrarily sized floating point number?" could use the below to maintain correlation to the existing IEEE-754 standard and extend to other bit sizes fp_size >= 8 (or >= 6 if you want to push it).

int expo_width(int fp_size) {
  return lrint(fp_size >= 32 ? 4*log2(fp_size)-13, 3*log2(fp_size)-7);
}