For academic purposes I want to try to write an ARM NEON optimization of the following algorithm, even only to test if it is possible to obtain any performance improvement or not. I think this is not a good candidate for SIMD optimization because results are merged togheter losing parallelization gains.
This is the algorithm:
const uchar* center = ...;
int t0, t1, val;
t0 = center[0]; t1 = center[1];
val = t0 < t1;
t0 = center[2]; t1 = center[3];
val |= (t0 < t1) << 1;
t0 = center[4]; t1 = center[5];
val |= (t0 < t1) << 2;
t0 = center[6]; t1 = center[7];
val |= (t0 < t1) << 3;
t0 = center[8]; t1 = center[9];
val |= (t0 < t1) << 4;
t0 = center[10]; t1 = center[11];
val |= (t0 < t1) << 5;
t0 = center[12]; t1 = center[13];
val |= (t0 < t1) << 6;
t0 = center[14]; t1 = center[15];
val |= (t0 < t1) << 7;
d[i] = (uchar)val;
This is what I thought in ARM assembly:
VLD2.8 {d0, d1} ["center" addr]
supposing 8 bit chars, this first operation should load all the t0 and t1 values alternatively in 2 registers.
VCLT.U8 d2, d0, d1
a single operation of "less then" for all the comparisons. NOTES: I've read that VCLT is possible only with a #0 constant as second operand, so this must be inverted in a >=. Reading ARM documentation i think the result of every 8 bit value will be "all 1" for true (11111111) or "all 0" for false (00000000).
VSHR.U8 d4, d2, #7
this right shift will delete 7 out of 8 values in the register 8-bit "cells" (mainly to delete 7 ones). I've used d4 because of the next step will be the first d register mapped in q2.
Now problems begin: shifting and ORs.
VSHLL.U8 q2[1], d4[1], 1
VSHLL.U8 q2[2], d4[2], 2
...
VSHLL.U8 q2[7], d4[7], 7
I can imagine only this way (if it's possible to use [offsets]) for left shifts. Q2 should be specified instead of d4 according to the documentation.
VORR(.U8) d4[0], d4[1], d4[0]
VORR(.U8) d4[0], d4[2], d4[0]
...
VORR(.U8) d4[0], d4[7], d4[0]
Last step should give the result.
VST1.8 d4[0], [d[i] addr]
Simple store of the result.
It is my first approach to ARM NEON, so probably many assumptions may be incorrect. Help me understand possible errors, and suggest a better solution if possible.
EDIT: This is the final working code after the suggested solutions:
__asm__ __volatile ("VLD2.8 {d0, d1}, [%[ordered_center]] \n\t"
"VCGT.U8 d2, d1, d0 \n\t"
"MOV r1, 0x01 \n\t"
"MOV r2, 0x0200 \n\t"
"ORR r2, r2, r1 \n\t"
"MOV r1, 0x10 \n\t"
"MOV r3, 0x2000 \n\t"
"ORR r3, r3, r1 \n\t"
"MOVT r2, 0x0804 \n\t"
"MOVT r3, 0x8040 \n\t"
"VMOV.32 d3[0], r2 \n\t"
"VMOV.32 d3[1], r3 \n\t"
"VAND d0, d2, d3 \n\t"
"VPADDL.U8 d0, d0 \n\t"
"VPADDL.U16 d0, d0 \n\t"
"VPADDL.U32 d0, d0 \n\t"
"VST1.8 d0[0], [%[desc]] \n\t"
:
: [ordered_center] "r" (ordered_center), [desc] "r" (&desc[i])
: "d0", "d1", "d2", "d3", "r1", "r2", "r3");