convert single precision floating point to half precision floating point

Question

I am struggling to convert 32bit floating point to 16bit floating point with C.

I understand the concept of normalizing, denormalizing, etc.

But I failed to understand the below result.

This conversion complies with IEEE 754 standard. (using round-to-even mode)

32bit floating point
00110011 01000000 00000000 00000000 

converted 16bit floating point
00000000 00000001

This is the step what I've taken.

Given 32bit floating point's sign bit is 0, exp field is 102, rest is fraction bits field.

So exp field 102 has to be -127 bias, so it becomes -25, and it goes like below.

// since exp field is not zero, there will be leading 1.
1.1000000 00000000 00000000 * 2^(-25)

When converting above number to half precision floating point, we have to plus bias (15) to the exponent to encode exp field.

so exp field is -10.

Since encoded exp field is smaller than 0, given 32bit floating point cannot be expressed successfully to the half precision floating point.

So I thought half precision floating point bit pattern will go like below

00000000 00000000

But Why 00000000 00000001?

I have read many articles that have been uploaded on stackoverflow, but they are just the code samples, not actually dealing with the internal behavior.

Can someone please contradict my misconception?

Please provide a complete citation and/or a complete quote that indicates that is the expected result, including any statements about what rounding mode is used. — Eric Postpischil
With an exponent part of the 32bit float in the 0x1p-25 range, why would 0 be the closest respresentable value if the miminum value of a 16bit float is 0x1p-24? Am I missing something ? — AugustinLopez

Chris Dodd Chris Dodd · Accepted Answer · 2019-10-05T22:44:38

Getting the biased exponent of -10, you need to create a denormalized number (with 0 in the exponent field), by shifting the mantissa bits right by 11. That gives you 00000 00000 11000... for the mantissa bits, which you then round up to 00000 00001 -- the smallest possible denorm number.

An IEEE fp number has a 1 bit sign, an n bit exponent field, and a m bit mantissa field. For the n bit exponent field, an all 1s value represent Inf or Nan and an all 0s value represents a denorm or zero (which depends on the mantissa bits). So only exponents in the range 1..2ⁿ-2 are valid for normalized numbers.

So when you calculate your "Normalized and biased" exponent, if it is &leq; 0, you need to generate a denorm (or zero) instead. The value for a normalized number is

-1^S(1.0 + 2^-mM)2^E-bias

(where M is the value in the mantissa field treated as an unsigned integer and m is the number of mantissa bits -- some descriptions write this as 1.M). The value for a denorm is

-1^S(0.0 + 2^-mM)2^1-bias

That is, the exponent is the same as for a biased exponent value of 1, but the "hidden bit" (the extra bit added to the top of the mantissa) is treated as 0 instead of 1. So to convert your normalized number with the (biased) exponent of -10 to a denorm, you need to shift the mantissa (including the hidden 1 bit that is normally not stored) by 1 - -10 bits (that is, 11 bits) to get the mantissa value you want for denorm. Since this will always shift by at least one bit (for any biased exponent &leq; 0), it will shift a 0 into the hidden bit position, matching the denorm meaning of the mantissa. If the exponent is small enough it will shift completely out of the mantissa, giving you a 0 mantissa (which is a zero). But in you specific case, even though it shifts entirely out of the 10 (representable in fp16 format) bits, the guard bits are still 1s, so it rounds up to 1.

convert single precision floating point to half precision floating point

1 Answers