[A lot of text incoming, since I want to detail my question as best as I can.]
I'm in the process of optimizing hand-written ARM assembly code for a Cortex-M0. The board I'm using is the STMicro STM32F0Discovery, which has an STM32F051R8 controller. The controller is running at 48 MHz.
Unfortunately, I'm getting some pretty strange cycle counts when doing optimizations.
For example, adding a single nop into a loop in my code should add 2 cycles in total (looped 2 times). However, doing so adds around 1800 extra cycles. Now, when I add in an extra nop (so 2 nops in total), the cycle count does increase by the expected 4 cycles.
I get similar strange results for the example piece of code below. The example code shows, for the top excerpt: c = 25 * a + 5 * b. The bottom excerpt is c = 5 * (5 * a + b). So, the bottom one should be faster, since it requires 1 less mov. However, changing this:
movs r4, #25
muls r3, r4, r3
add r2, r3
ldrb r3, [r6, #RoundStep]
movs r4, #5
muls r3, r4, r3
add r2, r3
into this:
movs r4, #5
muls r3, r4, r3
ldrb r5, [r6, #RoundStep]
add r3, r5
muls r3, r4, r3
add r2, r3
does not increase the speed by the expected 1 cycle, instead, it decreases the speed by more or less 1000 cycles...
To count the cycles, I'm using the SysTick counter, counting down from its max value, and increasing an overflow counter on overflow interrupt. The code that I'm using for this is more or less the same as this excerpt from the ARM website, but rewritten for the Cortex-M0 that I'm using. My code is sufficiently fast that an overflow interrupt never happens during measurements.
Now, I was starting to think that the counter was giving me wrong values, so I also wrote some code for a TI Stellaris LaunchPad I had lying around. This is a Cortex-M4F running at 80 MHz. The code measures the number of cycles a certain pin is held high. Of course, the clock of the M0 and that of the M4F aren't running in sync, so the reported cycle counts vary a bit, which I "fix" by taking a very low weighted exponential average of the measured cycle counts (avg = 0.995 * avg + 0.005 * curCycles) and repeating the measurement 10000 times.
The time measured by the M4F is the same as measured by the M0, so "unfortunately" it seems the SysTick counter is working just fine in the M0.
At first I thought these extra delays were caused by pipeline stalls, but on one hand the M0 seems to be too simple for that, and on the other I can't find any detailed info on the M0's pipeline, so I can't verify.
So, my question is: what is going on here? Why does adding a single nop make my function take an extra 1000 cycles/loop, but do two nops only increase the cycle count by 2? How come removing instructions makes my code execute slower?