I am trying to increase the performance of a piece of code written in ARM Assembler using Neon instructions.
For testing and calculating I use this calculator: http://pulsar.webshaker.net/ccc/sample-706454b3
I noticed that at line "n.34-0 1c n0" suddenly the Neon unit seems to have to wait(?) for 10 cycles. What could be the reason for that or is it just an bug in the calculator?
Also I would need some general information how to improve the performance in ARM/Neon Assembler.
Target is an ARM Cortex-A9. For compiling I use the newest android-ndk with inline Assembler. Thank you.