2
votes

I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.

I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...

Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.

I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.

As such any help would be HUGELY appreciated.

Also is there any way to get the throughput and latency figures for the NEON unit?

Thanks in advance!

3
Not that familiar with NEON, but can't you "widen" the 4 shorts to 4 ints and then convert? Looking at GCCs intrinsics I think maybe vaddl.s16 with a zero second operand might do it.user786653
@user786653: Hmmm that might just do it actually :DGoz
Yup .. that seems to work. Can't believe i didn't notice that instruction ..Goz

3 Answers

4
votes

If no other computation is to be done along with the conversion from 16bit integer to 32bit integer you can go for uint32x4_t = vmovl_u16 (uint16x4_t)

If any simple addition or multiplication etc is being performed before the conversion, you can combine them in a single instruction like int32x4_t = vmull_u16 (int16x4_t, int16x4_t) or int32x4_t = vaddl_u16 (int16x4_t, int16x4_t) etc and thus saving some amount of cycles.

2
votes

Elaborating a small bit on my comment: you want to "widen" the 4 16-bit registers to 4 32-bit integers before converting to 4 32-bit floats. Looking at the instruction set I don't think there are any faster conversion paths, but I could easily be wrong.

The direct method is to use vaddl.s16 with a second operand of four zeros, but unless you're only doing conversion you can often combine the conversion with a previous operation. E.g. if you're multiplying two int16x4 registers you can use vmull.s16 to get 32-bit output directly rather than first multiplying and widening later (provided you're not depending on any truncating behavior).

1
votes

why use vaddl wasting cycles initializing a valuable register with 0?

vmovl.s16 q0, d1

then convert q0

that will do.

My question is :

  • Is it absolutely necessary to convert them to float? NEON is much faster doing integer operations than float. (both execution and pipeline) Therefore, fixed-point operations will be more appropriate in most cases thanks to the powerful long, wide, narrow models combined with arithmetic instructions and automatic round/saturation options.

PS : strange, I think ARM's PDF to be the best around.