I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
- The address is 16 byte aligned.
- Have provided the alignment hint in the instruction
vld1.64 {d0, d1} [r0,:128]!
- Tried preload instruction
pld [r0, #192]
, at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
- With reference to cortex-a8
- arm-2009q1 cross compiler tool chain
- coding in assembly