2
votes

I’m trying to understand STM8 pipelining to be able to predict how much cycles my code will need.

I have this example, where I toggle a GPIO pin for 4 cycles each. Iff loop is aligned at 4byte-boundary + 3, the pin stays active for 5 cycles (i.e. one more than it should). I wonder why?

// Switches port D2, 5 cycles high, 4 cycles low
void main(void)
{
    __asm
        bset 0x5011, #2 ; output mode
        bset 0x5012, #2 ; push-pull
        bset 0x5013, #2 ; fast switching

        jra _loop
    .bndry 4
        nop
        nop
        nop
    _loop:
        nop
        bset 0x500f, #2
        nop
        nop
        nop
        bres 0x500f, #2
        jra _loop
    __endasm;
}

A bit more context:

  • bset/bres are 4 byte instructions, nop 1 byte.
  • The nop/bset/bres instructions take 1 cycle each.
  • The jra instruction takes two cycles. I think in the first cycle, the instruction cache is filled with the next 32bit value, i.e. in this case the nop instruction only. And the 2nd cycle is actually just the CPU being stalled while decoding the next instruction.

So in cycles:

  1. bres clears the pin
  2. jra, pipeline flush, nop fetch
  3. nop decode, bset fetch
  4. nop execute, bset decode, next nop fetch
  5. bset execute sets the pin
  6. nop, bres fetch
  7. nop
  8. nop, bres decode
  9. bres execute clears the pin

According to this, the pin should stay LOW for 4 cycles and HIGH for 4 cycles, but it’s staying HIGH for 5 cycles.

In any other alignment case, the pin is LOW/HIGH for 4 cycles as expected.

I think, if the PIN stays high for an extra cycle that must mean that the execution pipeline is stalled after the bset instruction (the nops thereafter provide enough time to make sure that bres later is ready to execute immediately). But according to my understanding nop (for 6.) would already be fetched in 4.

Any idea how this behavior can be explained? I couldn’t find any hints in the manual.

1

1 Answers

1
votes

It is explained in section 5.4, which basically says that throughout the programming manual, "a simplified convention providing a good match with reality" will be used. From my experience, this simplified convention is indeed a good approximate for a longer sequence, but unusable for exact per-instruction timing, even if you're working on assembly level and control alignment. Take "SLA addr" as an example. It is documented to use 1 cycle. Put three of them in sequence to implement the C equivalent of "*(addr) << 3", and you'll clock up 5-6 cycles.

Actual cycles used for decoding and execution are undocumented. Apart from the obvious reasons, there is no comprehensive documentation about what causes pipeline stalls. I was able to get some insight into this by configuring TIM2 with a prescaler of /1 and reload values of 0xFFFF while using ST-LINK/V2 to step through my code. You can then keep a watch on TIM2_CNTRL to see cycles consumed (== the aggregate value of executing the previous and decoding the current instruction).

Things to keep an eye on are obviously instructions spanning 32-bit boundaries. There were also cases where loading instructions from the next 32-bit word caused an unexpected additional cycle in a sequence of NOPs, suggesting that any fetch (even if not necessary for the current or next instruction) costs 1 cycle? I've seen CALLs to targets aligned to 32 bit boundaries taking 4-7 cycles, suggesting that the CPU was still busy executing the previous instruction or stalling the call for unknown reason. Modifying the SP (push/pop or direct add/sub) seems to be causing stalls under certain conditions.

Any additional insight appreciated!