I am having a slight problem in interpretation of OpenCL spec in regards to the -cl-fast-relaxed-math.
The definition of that compiler flag is:
Sets the optimization options -cl-finite-math-only and -cl-unsafe-math-optimizations. This allows optimizations for floating-point arithmetic that may violate the IEEE 754 standard and the OpenCL numerical compliance requirements defined in in section 7.4 for single-precision floating-point, section 9.3.9 for double-precision floating- point, and edge case behavior in section 7.5. This option causes the preprocessor macro
FAST_RELAXED_MATH to be defined in the OpenCL program.
In my program I don't basically care about the exact IEEE compliance. However in Intel integrated GPU this flag causes all of the trigonometric functions to use their own native implementation. In case of half_sin (10 bits of precision according to spec) it doesn't really matter. As in the worst case the output still has maybe 8 bits, so 2 bits of less precision than what the spec requires is not too bad.
However Intel GPU implementation (their CPU works fine) also replaces the calls to sin. Causing loss of up to 11 bits of precision. Way over half gets thrown away (19 bits is required for sin). I personally don't consider that as an expected behaviour anymore, and it makes software development bit hard when the only options are to either demand strict IEEE compliance or abandon majority of usable precision in normal builtin functions.
I am aware that strictly speaking they are following the spec. The compiler flag allows them to break the numerical compliance requirements in the section that defines the usable accuracy of built-in functions. However even the mildest flag that allows the compiler to ignore NaN's basically allows the total disregard of all built-in function accuracy.
The closest analog to C world is GCC flag Ffast-math, it allows the compiler to do the same, but yet in their case the normal math lib functions still remain usable even if they do assume floating point math is associative etc.
I would like to know how in general these sort of precision requirements are handled. Is Intel actually following the spirit of the spec? Does anyone know how to allow Intel GPU to optimize kernels to it's hearts content (such as assuming floating point math is associative) without unreasonably ruining the precision of built-in functions?