I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.
I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...
Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.
I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.
As such any help would be HUGELY appreciated.
Also is there any way to get the throughput and latency figures for the NEON unit?
Thanks in advance!