3
votes

[A lot of text incoming, since I want to detail my question as best as I can.]

I'm in the process of optimizing hand-written ARM assembly code for a Cortex-M0. The board I'm using is the STMicro STM32F0Discovery, which has an STM32F051R8 controller. The controller is running at 48 MHz.

Unfortunately, I'm getting some pretty strange cycle counts when doing optimizations.

For example, adding a single nop into a loop in my code should add 2 cycles in total (looped 2 times). However, doing so adds around 1800 extra cycles. Now, when I add in an extra nop (so 2 nops in total), the cycle count does increase by the expected 4 cycles.

I get similar strange results for the example piece of code below. The example code shows, for the top excerpt: c = 25 * a + 5 * b. The bottom excerpt is c = 5 * (5 * a + b). So, the bottom one should be faster, since it requires 1 less mov. However, changing this:

movs r4, #25
muls r3, r4, r3
add  r2, r3 

ldrb r3, [r6, #RoundStep]
movs r4, #5
muls r3, r4, r3

add  r2, r3

into this:

movs r4, #5
muls r3, r4, r3 

ldrb r5, [r6, #RoundStep]
add r3, r5
muls r3, r4, r3

add  r2, r3

does not increase the speed by the expected 1 cycle, instead, it decreases the speed by more or less 1000 cycles...

To count the cycles, I'm using the SysTick counter, counting down from its max value, and increasing an overflow counter on overflow interrupt. The code that I'm using for this is more or less the same as this excerpt from the ARM website, but rewritten for the Cortex-M0 that I'm using. My code is sufficiently fast that an overflow interrupt never happens during measurements.

Now, I was starting to think that the counter was giving me wrong values, so I also wrote some code for a TI Stellaris LaunchPad I had lying around. This is a Cortex-M4F running at 80 MHz. The code measures the number of cycles a certain pin is held high. Of course, the clock of the M0 and that of the M4F aren't running in sync, so the reported cycle counts vary a bit, which I "fix" by taking a very low weighted exponential average of the measured cycle counts (avg = 0.995 * avg + 0.005 * curCycles) and repeating the measurement 10000 times.

The time measured by the M4F is the same as measured by the M0, so "unfortunately" it seems the SysTick counter is working just fine in the M0.

At first I thought these extra delays were caused by pipeline stalls, but on one hand the M0 seems to be too simple for that, and on the other I can't find any detailed info on the M0's pipeline, so I can't verify.

So, my question is: what is going on here? Why does adding a single nop make my function take an extra 1000 cycles/loop, but do two nops only increase the cycle count by 2? How come removing instructions makes my code execute slower?

1
You simply cannot count cycles on a modern pipelined core like this. You are also forgetting about the flash in that part being slow it takes more than one clock to fetch one instruction. And I think those st parts have some strange flash thing to try to compensate meaning it may be a non-deterministic flash speed. - old_timer
Do you know where to find that info about non-deterministic flash speed in those ST parts? That would be extremely helpful! - AVH
I thought I saw some article or advertisement about one of their parts I assume it is this one or the f4 or both. Even if the flash is a fixed number of clocks per access what you are trying to do is still not going to be deterministic. - old_timer
In this case the FLASH_ACR contains the latency bit(s), so you are running with one clock wait state... - old_timer

1 Answers

3
votes

The mul instruction can be multiple cycles of the ALU pipe. Your transformation of c = 25 * a + 5 * b into c = 5 * (5 * a + b) requires one less mov. However, the load-store stage of the pipeline over-lays with the ALU. These are often separate stages and with a ldrb instruction you can get mov instructions for free. Also, depending on the values, the muls may execute faster; specifically, the top bytes being zero often result in a sorter multiply cycle. There are far less data dependencies in the first version; instruction n has no registers in common with n+1. This is a basic requirement to allow pipe-lining.

Compare,

ldrb r5, [r6, #RoundStep]  ; 2 cycles
add r3, r5                 ; must block for r5 to load (1 cycle)

with,

ldrb r3, [r6, #RoundStep]  ; 2 cycles
movs r4, #5                ; may run in parallel with above.

So even though you may add up the instruction count and have less code, it can turn out that a larger alternate will run faster due to pipe-lining or instruction scheduling.

The 2nd version may be faster if you can relocate the ldrb towards the beginning of the routine.