I'm working on code optimization for ARM processors using NEON. However I have a problem: my algorithm contains the following floating point computation:
round(x*b - y*a)
Where results can be both positive and negative.
Actually I'm using 2 VMUL and 1 VSUB to make parallel computation (4 values per operation using Q registers and 32bit floats).
There is a way I can handle this problem? If the results were all the same sign I know I can simply add or subtract 0.5