ARM Neon Assembler - strange pipeline issue

Question

I am trying to increase the performance of a piece of code written in ARM Assembler using Neon instructions.

For testing and calculating I use this calculator: http://pulsar.webshaker.net/ccc/sample-706454b3

I noticed that at line "n.34-0 1c n0" suddenly the Neon unit seems to have to wait(?) for 10 cycles. What could be the reason for that or is it just an bug in the calculator?

Also I would need some general information how to improve the performance in ARM/Neon Assembler.

Target is an ARM Cortex-A9. For compiling I use the newest android-ndk with inline Assembler. Thank you.

BitBank BitBank · Accepted Answer · 2012-03-15T16:45:38

The NEON unit has to wait at that instruction because you're referencing a register (D4) which was loaded in the previous NEON instruction (n.33-0 1c n0). Loads are not instantaneous and due to the pipelining, there is a delay in the availability of the data even if it comes from the cache. You need to reorder both your ARM and NEON instructions to not try to use registers immediately after you load them or you will end up with wasted cycles (pipeline stalls).

ARM Neon Assembler - strange pipeline issue

3 Answers