3
votes

I recently used a board (LPCXpresso 5411x) to do some computation and we tried to decrease cycles as long as we can to save the running time for our certain demand, so I needed to do some research on how cortex-m4 instructions cost cycles. And I've found many things weird (couldn't be explained by what I've found from the internet)

I used DWT->CYCCNT to count cycles consumed by a function I want to test.

int start_cycle, end_cycle;

__asm volatile (
  "LDR %[s1], [%[a]], #0\n\t"
  :[s1] "=&r"(start_cycle): [a] "r"(&(DWT->CYCCNT)):);

AddrSumTest();
__asm volatile (
  "LDR %[s1], [%[a]], #0\n\t"
  :[s1] "=&r"(end_cycle): [a] "r"(&(DWT->CYCCNT)):);

printf("inside the func() cycles: %d\n",end_cycle - start_cycle);

Here is how my function is defined:

__attribute__( ( always_inline )) static inline void AddrSumTest(){
    uint32_t x, y, i, q;

    __asm volatile (
        "nop\n\t"
        :[x] "=r" (x), [y] "=r" (y), [i] "=r" (i), [q] "=r" (q):);
    }
}
  • According to Arm Infocenter, the instruction MOV should cost one cycle, but I've found that

the following instructions cost 8 cycles(not 3 because extra cycles are needed to read from DWT->CYCCNT)

  "nop\n\t"
  "MOV %[x], #2\n\t"
  "nop\n\t"

after adding another MOV instruction, 10 cycles are needed for the following cycles(why not 9 cycles)

  "nop\n\t"
  "MOV %[x], #2\n\t"
  "MOV %[y], #3\n\t"
  "nop\n\t"

and the assembly codes for the latter situation are

4000578:    f853 4b00   ldr.w   r4, [r3], #0
400057c:    bf00        nop
400057e:    f04f 0502   mov.w   r5, #2
4000582:    f04f 0603   mov.w   r6, #3
4000586:    bf00        nop
4000588:    f853 1b00   ldr.w   r1, [r3], #0
400058c:    4805        ldr r0, [pc, #20]   ;(40005a4<test_AddrSum+0x30>)
400058e:    1b09        subs    r1, r1, r4
4000590:    f000 f80e   bl  40005b0 <__printf_veneer>

The two ldrs are reading from DWT->CYCCNT, besides, it's also strange why this would cost 10 cycles, and what I estimate is 2(from ldr) + 4 = 6

By the way, the board doesn't have any cache, and I store codes in sramx and stack is in sram2.

Do I miss something and it there any way I can figure out how every cycle is consumed? Besides, I'm also confused with data dependency of cortex-m4.

1
Without any cache you probably have to pay an additional cost in cycles for the instruction fetches. Also note that the cycle times you've calculated would only include one of the two LDR instructions that read DWT->CYCCNT, not both. Presumably they read the cycle count at the same relative point in the execution of both of the instructions (eg. at the start). For both to be included the cycle count would have to be at the start of the first LDR instruction and at the end of the second LDR instruction.Ross Ridge
I agreed with you, but did you mean an additional cost in cycles for every instruction fetcheStephen Yuan
Every instruction fetch, but I don't know how the Cortex-M4 does instruction fetches. It could be fetching 32-bit words on each fetch, for example, and so not every instruction would pay it.Ross Ridge

1 Answers

2
votes

taking a variation and I don't have that chip but have others. in this case using a ti cortex-m4. the st parts have this cache in front the flash, that I don't think you can turn off and (as designed) affects performance.

00000082 <test>:
  82:   f3bf 8f4f   dsb sy
  86:   f3bf 8f6f   isb sy
  8a:   6802        ldr r2, [r0, #0]
  8c:   46c0        nop         ; (mov r8, r8)
  8e:   46c0        nop         ; (mov r8, r8)
  90:   46c0        nop         ; (mov r8, r8)
  92:   46c0        nop         ; (mov r8, r8)
  94:   46c0        nop         ; (mov r8, r8)
  96:   46c0        nop         ; (mov r8, r8)
  98:   f240 0102   movw    r1, #2
  9c:   f240 0103   movw    r1, #3
  a0:   46c0        nop         ; (mov r8, r8)
  a2:   46c0        nop         ; (mov r8, r8)
  a4:   46c0        nop         ; (mov r8, r8)
  a6:   46c0        nop         ; (mov r8, r8)
  a8:   46c0        nop         ; (mov r8, r8)
  aa:   46c0        nop         ; (mov r8, r8)
  ac:   46c0        nop         ; (mov r8, r8)
  ae:   6803        ldr r3, [r0, #0]
  b0:   1ad0        subs    r0, r2, r3
  b2:   4770        bx  lr

So without the second movw it takes 0x11 clocks in flash, and between 0x10 and 0x11 in ram depending on alignment. When the thumb2 instruction is aligned on a word boundary, it takes a clock longer than when unaligned.

using the thumb instruction 0x2102

00000000 20001016 00000010 
00000002 20001018 00000010 
00000004 2000101A 00000010 
00000006 2000101C 00000010 

using the thumb2 extension 0xf240, 0x0102

00000000 20001016 00000010 
00000002 20001018 00000011 
00000004 2000101A 00000010 
00000006 2000101C 00000011 

using the thumb2 extensions 0xf240, 0x0102, 0xf240, 0x0103

00000000 20001016 00000012 
00000002 20001018 00000013 
00000004 2000101A 00000012 
00000006 2000101C 00000013 

And this is not really a surprise, likely has to do with fetching. These microcontrollers are much simpler than the full sized arms. The full sized will fetch say 8 instructions per fetch, and depending on where things lie in the fetch line can affect performance, moreso with loops and where the branch lies in the fetch line (doesn't matter if the cache is on or off). Branches also have branch predictors you can turn on and off and can vary in design.

This particular chip says that above 40Mhz it enables a prefetch that fetches one word, implying that below it fetches one halfword (the bus is likely a word wide so reads the same address twice to get the two instructions there...why?)

Other chips (cortex-ms as well as others) you have to control the wait states on the flash, sometimes the flash is half the speed of the ram and the same code, same machine code, runs faster on ram even at low speeds and only gets worse as you increase the clock and increase the number of wait states on the flash to keep its speed in check.

The ST family in particular has some marketing term for a prefetch cache thing they put in you cant disable. You can do a dsb/isb just before the code under test and for example see the affects of wait states for a single pass, but if doing a test loop

test_loop: sub r3,#1
bne test_loop

and running it a lot of times those few clocks at the beginning are reflectied but small, just like using a cache, but you should still see fetch line effects against a cache if the processor lets you see those.

Some chips have a flash prefetch you can enable or disable, which particularly with loops can hurt performance rather than help if you align things just right such that the prefetcher is reading well past the end of the loop.

ARM ip stops at the arm busses on the edge of the core (AXI,AMBA,AHB,APB,whatever), in general you might have ARM ip for an L2 cache (not in one of these microcontrollers) and you may buy some arm ip to help you with their bus, but eventually the chip has chip specific stuff in it, which arm has nothing to do with and is not consistent from chip vendor to chip vendor, in particular the flash and the sram interfaces.

There is first off no reason to expect predictable results with a pipelined processor, as shown above, and really easy to show with a two instruction loop, the same machine code can vary widely in performance due to alignment alone, but also factors that you are in control of directly or indirectly, flash wait states, the relative speed of the clock vs the flash. If a/the boundary between N and N+1 wait states on our device is at 24Mhz, so 24Mhz at N wait states is much faster than 24Mhz at N+1 wait states. 28Mhz (N+1 wait states) is faster than 24Mhz at N+1 wait states, but eventually the cpu clock may overcome the wait state and you can find a cpu speed that outperforms 24Mhz n+1 wait states, as far as overall wall clock timed performance, not cpu clocks being counted, the cpu clocks being counted if affected by the flash wait states should always be affected by the flash wait states.

The srams tend to not have wait states and run as fast as the CPU but there are probably exceptions to that. No doubt the periperhals have limits, many of the vendors have rules about peripheral clocks, this one cant be above 32mhz even though the part goes to 48, that kind of thing, so a benchmark that accesses a peripheral will take a different number of cpu clocks at different cpu/system speed settings.

You also have configurable options in the processor, basically compile time options. the cortex-m4 doesn't advertise this but the cortex-m0+ does can be configured for a 16 or 32 bit instruction fetch width. I don't have visibility to that source code so it may be something that has to be compile time or something that if you choose you can setup a control register and have it runtime configurable, or perhaps have logic that says if the pll settings are such then force one way, else the other, and so on. So even if you have two chips from different vendors with the same rev and model cpu core, that doesnt mean they will behave the same. Not to mention the chip vendor has the source code and can make modifications.

So trying to predict cycle counts on a pipelined processor in a system that you don't have visibility into, is not going to happen. You will have times that you add an extra nop and it gets faster, times where you add one and it gets slower as one would expect and times where it doesn't change. And if a nop can do that then any other instruction can as well.

Not to mention messing with the pipe itself, these cortex-ms are really short pipes so we are told so forcing a sequence of instructions with a lot of dependencies vs a similar sequence without won't have as big of an affect.

Take the same machine code under test run it on several cortex-m4s from different vendors (or even cortex-m3s and cortex-m7s as well), flash and ram, with different settings, and there should be no surprise if the execution time in cpu ticks varies.