I have 4 lanes d0, d1, d2, d3 and I would like to average four adjacent:
d0[0] = (d0[0] + d0[1] + d0[2] + d0[3]) / 4
d0[1] = (d0[4] + d0[5] + d0[6] + d0[7]) / 4
d0[2] = (d1[0] + d1[1] + d1[2] + d1[3]) / 4
...
Is the following Neon code correct?
vpaddl.u8 d0, d0
vpaddl.u8 d1, d1
vpaddl.u8 d2, d2
vpaddl.u8 d3, d3
vpadd.u16 d0, d0, d2
vshrn.u16 d0, q0, #2
If yes, is there a faster way to do it?
EDIT 1
The code above was not correct. I came up with the following:
vpaddl.u8 d0, d0
vpaddl.u8 d1, d1
vpaddl.u8 d2, d2
vpaddl.u8 d3, d3
vuzp.u16 q0, q1
vadd.u16 q0, q0, q1
vshrn.u16 d0, q0, #2
which is working. This is very similar to the second suggestion of 'Notlikethat' accepted answer but in a less-optimized way.
vld4
, you could then easily use normal parallel operations as intended. – Notlikethat