Understanding cycle counts on Cortex M4

Question

I'm playing around with an STM32F407 with a Cortex M4 and I'm measuring cycle counts of a function by reading DWT_CYCCNT directly before and after calling a function (in C) that I implemented in assembly. I'd like to understand the results that I get.

08000610 <my_function>:
 8000610:       f04f 20ff       mov.w   r0, #4278255360 ; 0xff00ff00
 8000614:       f04f 11ff       mov.w   r1, #16711935   ; 0xff00ff
 8000618:       ea81 0100       eor.w   r1, r1, r0
 800061c:       ea81 0100       eor.w   r1, r1, r0
 8000620:       ea81 0100       eor.w   r1, r1, r0
 8000624:       ea81 0100       eor.w   r1, r1, r0
 8000628:       4770            bx      lr
 800062a:       bf00            nop

Executing the above (including the function call) takes 21 cycles. When I add one eor instruction:

08000610 <my_function>:
 8000610:       f04f 20ff       mov.w   r0, #4278255360 ; 0xff00ff00
 8000614:       f04f 11ff       mov.w   r1, #16711935   ; 0xff00ff
 8000618:       ea81 0100       eor.w   r1, r1, r0
 800061c:       ea81 0100       eor.w   r1, r1, r0
 8000620:       ea81 0100       eor.w   r1, r1, r0
 8000624:       ea81 0100       eor.w   r1, r1, r0
 8000628:       ea81 0100       eor.w   r1, r1, r0
 800062c:       4770            bx      lr
 800062e:       bf00            nop

This suddenly becomes 28 cycles.

Adding another eor does not change the cycle count (still 28). Adding one more increases the cycle counter by 1 as expected (so 29).

Why?

According to ARM, an eor should just be 1 cycle always.
I don't see how the 3-stage pipeline could explain this behaviour.
The instructions are all word-aligned, so no issue there.
I suspect it's related to flash access, although I tried putting this code in IWRAM and executing from there, but that didn't change anything.
Looking at the objdump of my binary, I can confirm that there's nothing wrong with my measurements.
Finally, I experimented a bit with forcing the use of the 16-bit Thumb encoding, but that didn't help me in understanding what's happening.

Any ideas? :)

(This question is somewhat similar to #18960524, but without mul and load instructions that may mess things up.)

@dwelch As far as I know this platform does not have any caches and just fetches one instruction at a time. Thanks for the ideas though. — Bla Blaat
the cortex-m4 TRM says to see the ARM for information the DWT. but the armv7-m ARM does not have these registers documented. where do we find those docs? — old_timer
and how is this register different than using the systick timer? — old_timer
@dwelch They are definitely documented in the ARMv7-M Architecture Reference Manual. See for instance everything around table C1-25 on page C1-49 (593/716). I don't know the differences in detail. — Bla Blaat
I have a fairly recently downloaded0403E.b page 593 is B1.5.7, page 716 is B4.6.5. But I see the problem near my mcu code I have a downloaded doc set for that mcu plus these arm docs, these are old docs. Elsewhere I have a download spot for arm docs in general. The recently downloaded one, does have a hit for searches on DWT and CYCCNT. C1-779 C1.8. Thanks! Interestingly on my STM32F411 DWT_CTRL shows 0x40000000 4 counters, and does nto have NOCYCCNT set, but when I enable it it doesnt take, doesnt run. so I used the sys tick counter instead for that mcu. — old_timer

Notlikethat Notlikethat · Accepted Answer · 2016-01-11T22:32:25

The core doesn't have a cache^*, but the system certainly does - namely ST's "ART Accelerator".

As explained in section 3.5.2 of the TRM, this thing sits in the bus path making full-width (128-bit) fetches from the flash, then feeding those instructions to the core's ICode interface as it requests them.

Section 3.5.1 documents the number of flash wait states vs. clock speed and voltage configurations, which for an STM32F407 means up to a worst case of 7 cycles. I'm going to guess from the nature of the question that you probably haven't enabled the accelerator's prefetch or instruction cache functionality, which means that every 16-bytes-worth of instructions you're going to pause for n cycles for those wait states while the next chunk is dragged in from flash.

The maths gets rather more awkward than I feel like trying to work out right now, but suffice to say that that 21 cycles is some overlapping combination of at least 7 execution cycles, 2 pipeline refills (1-3 cycles each for the call and return) and at least 2*n wait states to fetch at least 2 blocks from flash.

Now, a salient thing to note is that the first function is 28 bytes long, whilst the second is 32 - i.e. exactly two 16-byte chunks. Second fact of note: the M4's ICode interface only performs 32-bit reads, from which it then feeds the fetch stage of the pipeline (I assume it simply twiddles its thumbs for a cycle when the pipeline only consumes the first halfword). I'm pretty confident that what you're seeing in the second example is an unpleasant interaction between the two - with some educated guessing I imagine this set of circumstances:

As the fetch stage of the pipeline is pulling in the instruction at 0x800062c, the ICode interface is starting the bus request for the next word at 0x8000630.
As the pipeline decodes the bx lr and fetches 0x800062e from the same instruction word, the ICode interface takes a breather, but the accelerator is now waiting on the flash to deliver the read of 0x8000630-0x8000640.
The branch executes, and the ICode interface now has to sit around and wait for n-1 cycles for the accelerator to finish making a read which is now just going to be thrown away, before it can request whatever address lr held (then wait yet another n cycles to actually get it).

It seems like looking at FLASH_ACR should give a clearer idea of exactly what your configuration is, if you really want to try accounting for every cycle - unless you clock the whole thing right down to the zero-wait-state configuration that ARM's core timings assume (note the first paragraph), you're going to have to consider more than just the core. More generally, I'd suggest that "programming a microcontroller without thoroughly studying the vendor's documentation" is right up there with "simply walking into Mordor" ;)

_{* Cortex-M7 is the first of ARM's M-class cores to actually have its own internal caches.}

Understanding cycle counts on Cortex M4

1 Answers