I have one example 50.33123 can be stored in FP32(1.8.23) format as 0x4249532E . If we convert this to binary
0100 0010 0100 1001 0101 0011 0010 1110
First bit is sign bit, which is 0
means positive number,
Next 8 bits are exponent -> 1000 01002 -> 8416 -> 13210. Exponent 132 -127 = 5
Mantissa 1.1001 0010 1010 0110 0101 110
(23 bits)
left shift my exponent => 110010.0101010011001011102 => 50.3312310
If we store same thing, in FP16(half precision format) FP16 => 1.5.10:
without rounding
1.1001 0010 102
left shift by 5 => 110010.010102 => 50.312510,
the error is 50.33123 - 50.3125 => 0.01873.
with rounding
1.1001 0010 112 => left shift by 5 => 110010.010112 => 50.3437510,
the error is 50.33123 - 50.34375 = -0.01252
My question is, here the error is significant.
Is there any way to reduce the error further with FP16 implementations?
257-255==2
" ? 2 is the smallest error you can get for 257 represented in an unsigned 8bit. – Yunnosch