How do we minimize precision error with FP16 half precision floating point numbers

Question

I have one example 50.33123 can be stored in FP32(1.8.23) format as 0x4249532E . If we convert this to binary

0100 0010 0100 1001 0101 0011 0010 1110

First bit is sign bit, which is 0 means positive number,

Next 8 bits are exponent -> 1000 0100₂ -> 84₁₆ -> 132₁₀. Exponent 132 -127 = 5

Mantissa 1.1001 0010 1010 0110 0101 110 (23 bits)

left shift my exponent => 110010.010101001100101110₂ => 50.33123₁₀

If we store same thing, in FP16(half precision format) FP16 => 1.5.10:

without rounding
1.1001 0010 10₂ left shift by 5 => 110010.01010₂ => 50.3125₁₀,
the error is 50.33123 - 50.3125 => 0.01873.

with rounding
1.1001 0010 11₂ => left shift by 5 => 110010.01011₂ => 50.34375₁₀,
the error is 50.33123 - 50.34375 = -0.01252

My question is, here the error is significant.
Is there any way to reduce the error further with FP16 implementations?

I think you have reached the end of precision, no way to minimise the error any further. I.e. I am not sure I understood your question correctly. Isn't it the same as asking "how can I reduce the error of representing 257 in an unsigned byte? 257-255==2" ? 2 is the smallest error you can get for 257 represented in an unsigned 8bit. — Yunnosch
Can you show the kind of calculation you do on those 16bit floats? Maybe wiht a bit of math it is possible to work on a foundation of a "middle value" (working point), stored in one float and then always calculate the delta. Finally add middle value and delta and use the result for whatever. — Yunnosch
example: float a=50.33123, b=50.33123; type fp_16 a_fp16, b_fp16; a_fp16=(fp_16) a; b_fp16=b; for(int i =0; i<1000;i++) { out_fp16 += a_fp16*b_fp16; } I am seeing huge precision error in this case. — sathyarokz
typo corrected. out_fp32 += a_fp16*b_fp16; a and b float values vary in my original case.. just for simplicity , I added fixed float values. — sathyarokz
I understand that you have to store a and b in 16bit floats; the calculation result however is finally stored (and accumulated) in a 32bit float. Did you try to first convert to 32bit, then calculate purely in 32bit. Afterwards, the converted 32bit can be deleted, i.e. a and b stay 16bit stored only. I understand that this might not be the solution, but the experiment might be enlightening. Theoretically, you might accumulate (over the loop) a rather small error, letting it grow big. I actually doubt that, but for clarity and for exclusion analysis, the experiment seems worthwhile. — Yunnosch

chux - Reinstate Monica chux - Reinstate Monica · Accepted Answer · 2017-06-10T19:31:02

how do we minimize precision error with FP16 half precision floating point numbers

Fp16 => 1.5.10 explicitly stores 10 bits of precision in fp_16, a binary floating point format. With the implied bit, that provides values whose Unit in the Last Place is 2^-10 of the most significant bit. 50.33123 as a float has an exact value of 50.33123016357421875 or 0x1.92A65Cp+5. With rounding to minimize precision error, the closest value as fp_16 is 50.34375 or 0x1.92Cp+5.

OP has done this rounding for minimal error.

... the error in this case is, 50.33123 - 50.34375 = -0.01252
My question is, here the error is significant. is there any way to reduce the error further with FP16 implementations?

This 0.02% difference is not unexpected. Without changing the 1.5.10 format, or saving additional values as below, this precision loss is unavoidable.

float a = 50.33123f;
a_fp16_upper = (fp_16) a;
a_fp16_lower = (fp_16) (a - a_fp16_upper);

How do we minimize precision error with FP16 half precision floating point numbers

1 Answers