3
votes

I am trying to increase the performance of a piece of code written in ARM Assembler using Neon instructions.

For testing and calculating I use this calculator: http://pulsar.webshaker.net/ccc/sample-706454b3

I noticed that at line "n.34-0 1c n0" suddenly the Neon unit seems to have to wait(?) for 10 cycles. What could be the reason for that or is it just an bug in the calculator?

Also I would need some general information how to improve the performance in ARM/Neon Assembler.

Target is an ARM Cortex-A9. For compiling I use the newest android-ndk with inline Assembler. Thank you.

3

3 Answers

3
votes

The NEON unit has to wait at that instruction because you're referencing a register (D4) which was loaded in the previous NEON instruction (n.33-0 1c n0). Loads are not instantaneous and due to the pipelining, there is a delay in the availability of the data even if it comes from the cache. You need to reorder both your ARM and NEON instructions to not try to use registers immediately after you load them or you will end up with wasted cycles (pipeline stalls).

2
votes

You shouldn't access memory via ARM while NEON is doing its job. It causes a full brake on NEON.

Apparently, you are trying some kind of parallel processing which is devastating for the reason above.

Besides, there are way too many ldrb's. Byte access on ARM is also almost a sin.

I suggest you completely rewrite your code in C first, with 32 bit only memory accesses, then evaluate whether it is meant to be for NEON at all,

2
votes

In fact this is a little bit more complexe. BitBank Is right, NEON have to wait for D4.

But you have to wait for 10 cycles because Neon have a Load/Store Queue. And the queue is filled with other instruction before the

vld1.64 d4, [r7, :64]

So when you need D4, you must wait for the execution of this instruction but To execute this instruction you must execute all the previous Load/Store instruction pushed into the NEON Load/Store Queue.