0
votes

I'm working on code optimization for ARM processors using NEON. However I have a problem: my algorithm contains the following floating point computation:

round(x*b - y*a)

Where results can be both positive and negative.

Actually I'm using 2 VMUL and 1 VSUB to make parallel computation (4 values per operation using Q registers and 32bit floats).

There is a way I can handle this problem? If the results were all the same sign I know I can simply add or subtract 0.5

2

2 Answers

1
votes

First, NEON suffers from long latency especially after float multiplications. You won't gain very much with two vmuls and one vsub due to this compared to vfp programming.

Therefore, your code should look like :

vmul.f32 result, x, b
vmls.f32 result, y, a

Those multiply-accumulate/substract instructions are issued back-to-back with the previous multiply instruction without any latency. (9 cycles saved in this case)

Unfortunately however, I don't understand your actual question. Why would someone want to round float values? Apparently you intend to extract the integer part rounded, and there are several ways to do this, and I cannot tell you anything more cause your question is as always too vague.

I've been following your questions in this forum for quite some time, and I simply cannot get rid of the feeling that you're lacking something very fundamental.

I suggest you to read the assembly reference guide pdf from ARM first.

2
votes

I have no knowledge in assembly, but using the NEON intrinsics in C (I mention their assembly equivalents to help you browse the documentation, even though I would not be able to use them myself), the algorithm for a round function could be:

// Prepare 3 vectors filled with all 0.5, all -0.5, and all 0
// Corresponding assembly instruction is VDUP
float32x4_t plus  = vdupq_n_f32(0.5);
float32x4_t minus = vdupq_n_f32(-0.5);
float32x4_t zero  = vdupq_n_f32(0);

// Assuming the result of x*a-y*b is stored in the following vector:
float32x4_t xa_yb;

// Compare vector with 0
// Corresponding assembly instruction is VCGT
uint32x4_t more_than_zero = vcgtq_f32(xa_yb, zero);
// Resulting vector will be set to all 1-bits for values where the comparison
// is true, all 0-bits otherwise.

// Use bit select to choose if you have to add or substract 0.5
// Corresponding assembly instruction is VBSL, its syntax is quite alike
// `more_than_zero ? plus : minus`.
float32x4_t to_add = vbslq_f32(more_than_zero, plus, minus);

// Add this vector to the vector to round
// Corresponding assembly instruction is VADD,
// but I guess you knew this one :D
float32x4_t rounded = vaddq_f32(xa_yb, to_add);

// Then cast to integers!

I guess you'll be able to convert this to assembly (I'm not, anyway)

Note that I have no idea if this is really more efficient than standard code, non-SIMD code!