I'm playing around with an STM32F407 with a Cortex M4 and I'm measuring cycle counts of a function by reading DWT_CYCCNT
directly before and after calling a function (in C) that I implemented in assembly. I'd like to understand the results that I get.
08000610 <my_function>:
8000610: f04f 20ff mov.w r0, #4278255360 ; 0xff00ff00
8000614: f04f 11ff mov.w r1, #16711935 ; 0xff00ff
8000618: ea81 0100 eor.w r1, r1, r0
800061c: ea81 0100 eor.w r1, r1, r0
8000620: ea81 0100 eor.w r1, r1, r0
8000624: ea81 0100 eor.w r1, r1, r0
8000628: 4770 bx lr
800062a: bf00 nop
Executing the above (including the function call) takes 21 cycles. When I add one eor
instruction:
08000610 <my_function>:
8000610: f04f 20ff mov.w r0, #4278255360 ; 0xff00ff00
8000614: f04f 11ff mov.w r1, #16711935 ; 0xff00ff
8000618: ea81 0100 eor.w r1, r1, r0
800061c: ea81 0100 eor.w r1, r1, r0
8000620: ea81 0100 eor.w r1, r1, r0
8000624: ea81 0100 eor.w r1, r1, r0
8000628: ea81 0100 eor.w r1, r1, r0
800062c: 4770 bx lr
800062e: bf00 nop
This suddenly becomes 28 cycles.
Adding another eor
does not change the cycle count (still 28). Adding one more increases the cycle counter by 1 as expected (so 29).
Why?
- According to ARM, an
eor
should just be 1 cycle always. - I don't see how the 3-stage pipeline could explain this behaviour.
- The instructions are all word-aligned, so no issue there.
- I suspect it's related to flash access, although I tried putting this code in IWRAM and executing from there, but that didn't change anything.
- Looking at the objdump of my binary, I can confirm that there's nothing wrong with my measurements.
- Finally, I experimented a bit with forcing the use of the 16-bit Thumb encoding, but that didn't help me in understanding what's happening.
Any ideas? :)
(This question is somewhat similar to #18960524, but without mul
and load instructions that may mess things up.)