2
votes

So I know that the smallest positive integer not representable by a single precision floating point is 2^(23+1) + 1 = 16,777,217.

How did we figure out that we use 2^(23+1) + 1. I understand that there is an implied 1, along with 23 being the number of bits represented in the mantissa but why does this work?

3
Any power-of-two integer (up to 2^127) can be represented exactly with floating-point. You will have to make the definition a bit more precise.nneonneo
I want to find the smallest integer that can't be represented from single floating point precision (Not necessarily powers of two) . From what I've found, it is 16,777,217, however I don't fully understand how we find that number.user2540748
@Wilson: You answered your question in your second paragraph. I have nothing to add to your explanation. In particular, I don't know what you're looking for when you ask "why does this work?."tmyklebu
@nneonneo Since there is only 2^32 different ways to set 32 bits (a float), you can at most represent 2^32 different values exactly. The float type behaves as a decimal number where you can only use a 7 of the digits 1-9, but you can fill up with 120+ zeroes before or after if needed.Daniel B.
@DanielB. If you read my comment carefully you'll note I refer specifically to power-of-two integers - 2^0, 2^1, 2^2, 2^3, ..., 2^127. All of those can be represented precisely in single-precision float, although their neighboring integers may not be represented precisely.nneonneo

3 Answers

2
votes

I think the trick here is to understand the basis of the floating point representation: Each number is represented as 1.fraction * 2^exponent. The key here is to know there are limits for both exponent (8 bits) and fraction (23 bits), but these limits do not necessarily match. For example, we can create 2^24 with the 8 bits exponent, while we cannot make 2^-24 with fraction (because it only has 23 bits). Thus, if you want to make number 16777216 = 2^24, you just set the fraction to 0 and set the exponent to represent 24. However, if you want to represent 16777217 = 2^24 + 1, the only thing you can do is to add a small fraction so when it multiplies with 2^24 it produce 1, and that small fraction should be 2^-24, which unfortunately cannot be produced by only 23 digits.

1
votes

I think i got your question. Have a look on this, especially on how the structure/design of these variables are done. http://en.m.wikipedia.org/wiki/Single_precision

Float usually stands for floating point variable. This means you have (usually 3 bytes) where you store your number. Then you additionally have a (one byte) exponent, which says where to set the point within this number.

Now you can easily calculate the max and min numbers one can store in this value.

But there is a tricky part. As this is no fix point integer, it may have limited precision that can cause strange problems. As the number gets bigger, the absolute distance between the number is getting bigger. At some point, you will reach a number, where you can add 1, and it will remain the same number, cause one is outside the precision range you have available. As you will see on the wiki page above: 1 bit is used to flag negative numbers, 23 bits are used for your precision and 8 are used as an exponent. Now imagine, as an example, the exponent would be 40, now you will have a 23 bits number where the point is placed on position 40. All the rest is filled with 0. Adding 1 would not change the number, cause it is outside the significant scope and would not be stored.

Maybe you were asking why there is another +1 in the exponent. This is nicely explained here: Which is the first integer that an IEEE 754 float is incapable of representing exactly? It's cause an mantissa of the form abcdefg actually stands for 1.abcdefg.

1
votes

How did we figure out that we use 2^(23+1) + 1

IEEE floating point represents normalised numbers in the following form.

(-1)s2(e – e0)(1+(m/2M))

Where:

  • s is the sign bit, with a value of 0 or 1.
  • e is the exponent field, with a value between 1 and 2E-2 where E is the number of exponent bits (values of 0 and 2E-1 are used for subnormals and infinites/NaNs).
  • e0 is the exponent bias. It essentially sets the overall range of the floating point number.
  • M is the number of mantissa bits.
  • m is the mantissa with a value between 0 and 2M-1

zero can't be represented by this format, but it can be resented as a subnormal, so that is ok. All non-zero integers that can be represented are represented in the normalized format.

To convert a positive integer i to floating point we use.

e = floor(log2(i)) + e0

m = ((i/2(e – e0))-1) 2M

If i < 2M+1 then (e − e0) ≤ M so m is an integer. Therefore i is representable.

If i = 2M+1 then (e − e0) = M+1 and m = 0. Therefore i is representable.

If i = 2M+1 + 1 then (e − e0) = M+1 and m = ½. Therefore i is not representable.