0
votes

I have 4 lanes d0, d1, d2, d3 and I would like to average four adjacent:

d0[0] = (d0[0] + d0[1] + d0[2] + d0[3]) / 4
d0[1] = (d0[4] + d0[5] + d0[6] + d0[7]) / 4
d0[2] = (d1[0] + d1[1] + d1[2] + d1[3]) / 4 
...

Is the following Neon code correct?

vpaddl.u8 d0, d0
vpaddl.u8 d1, d1

vpaddl.u8 d2, d2
vpaddl.u8 d3, d3

vpadd.u16 d0, d0, d2
vshrn.u16 d0, q0, #2

If yes, is there a faster way to do it?

EDIT 1

The code above was not correct. I came up with the following:

vpaddl.u8 d0, d0
vpaddl.u8 d1, d1
vpaddl.u8 d2, d2
vpaddl.u8 d3, d3
vuzp.u16 q0, q1
vadd.u16 q0, q0, q1
vshrn.u16 d0, q0, #2

which is working. This is very similar to the second suggestion of 'Notlikethat' accepted answer but in a less-optimized way.

1
What other operations are you doing on this data beforehand? In general, horizontal operations are best avoided where possible, so having the data arranged differently in the first place would be the best option - if it were viable to interleave adjacent data across d0-d3 with a vld4, you could then easily use normal parallel operations as intended.Notlikethat
I agree but unfortunately, no, I can't do a vld4 before in this case. I was also thinking doing zip or vext to rearrange data but I'm not sure I would manage to have fewer instructions and I'm limited in the number of available lanes. Note that my question still stands (!), is the code at least correct?gregoiregentil

1 Answers

3
votes

After a fair bit of playing around, I'd say that under the given constraints, your general approach is probably about as good as you're going to get. With more than 4 registers worth of data to work on at once, and/or the opportunity to absorb the cost of rearranging vectors elsewhere in the algorithm, the conclusion might well be different, but otherwise the viable options are very limited.

You are, however, forgetting d1 and d3 in your second round of reductions, and the first round can be simplified down to two instructions since unlike vpadd, vpaddl does have a Q-register form. Fixing those ends up with this:

vpaddl.u8 q0, q0
vpaddl.u8 q1, q1
vpadd.u16 d0, d0, d1
vpadd.u16 d1, d2, d3
vshrn.u16 d0, q0, #2

In terms of pure instruction count, it is possible to go lower by doing a partial transposition then accumulating two pairwise reductions directly, thus:

vuzp.16 q0, q1
vpaddl.u8 q0, q0
vpadal.u8 q0, q1
vshrn.u16 d0, q0, #2

but Q-form vuzp is really expensive on everything I can find instruction timings for, and the accumulate of vpadal isn't free either, so that's almost certainly going to end up performing worse overall than the more straightforward version.