2
votes

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:

http://pulsar.webshaker.net/ccc/sample-d3a7fe78

As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?

I have taken care of the following:

  1. The address is 16 byte aligned.
  2. Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
  3. Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.

Can someone tell me what am I doing wrong, why this latency?

Other Details:

  • With reference to cortex-a8
  • arm-2009q1 cross compiler tool chain
  • coding in assembly
1
Does this reflect more reality? pulsar.webshaker.net/ccc/beta-sample-d3a7fe78 (using 'beta' simulator)Aki Suihkonen
@AkiSuihkonen, how is that possible? VPADAL and VLD should be able to run in parallel, doesn't look like from the simulator link you gave, and also why does NEON have to start so late?nguns
sorry, couldn't find some easy copy paste. However answer to your question /confusion is, CPU's can not make timing guarantees on external memories. Let this be a cache (if it is not tightly integrated) or worse external memory. That's why people talk about ddr2, ddr3 etc. They have different performance characteristics. You should read about your whole system at this stage to understand how much stall you can get from l1, l2 and ram.auselen
Just to add one more thing (I'll try to create a proper answer later if someone wouldn't do that before), that timing on the TRM is for "executing" / "issuing" - so I believe they refer to putting your load request in load/store queue.auselen
The manual indeed claims loading of qX @128 is a single cycle operation -- but it has to presuppose something -- e.g. that the address has been pre-fetched.Aki Suihkonen

1 Answers

2
votes

Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:

    veor.u16 q12,q12,q12     @ clear accumulated sum
top_of_loop:
    vld1.u16 {q0,q1},[r0,:128]!
    vld1.u16 {q2,q3},[r0,:128]!
    vpadal.u16 q12,q0
    vpadal.u16 q12,q1
    vpadal.u16 q12,q2
    vpadal.u16 q12,q3
    vld1.u16 {q0,q1},[r0,:128]!
    vld1.u16 {q2,q3},[r0,:128]!
    vpadal.u16 q12,q0
    vpadal.u16 q12,q1
    vpadal.u16 q12,q2
    vpadal.u16 q12,q3
    subs r1,r1,#8
    bne top_of_loop

Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.

Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).