2
votes

As far as I understand, Cortex M0/M3 processors have only one memory space that holds instructions and data and that the access is only through the memory bus interface. Thus, if I understand correctly, every clock cycle the processor must read a new instruction to enter the pipeline, but that means that the bus will always be busy reading instructions so how data can be read simultaneously (for load word/store word instructions for example)? Additionally, what is the latency of reading an instruction from the memory? since if it is not a single cycle, then the processor must constantly halt itself until the next instruction is fetched, so how it is handled?

Thanks

3
every clock cycle the pipe wants a new instruction vs needs a new instruction. as with any pipelined processor the processor stalls because we cant feed it, mcus have less "stuff" around them than a bigger sized one so are expected to stall a lot more. - old_timer
One memory space does not imply one bus and thus bus contention. And one "bus" in ARMs world is not one bus...Before it leaves the ARM IP it ends up as multiple busses with multiple busses within each bus. It then ends up in the hands of the chip vendor as to how they keep separate or sometimes merge the busses in their product. - old_timer

3 Answers

2
votes

Yes this is how it happens, the processor stalls a lot, this goes on with big processors as well as small, difficult at best to feed a pipelined processor (although some of these pipes are shallow on cortex-ms, but pipelined nevertheless).

Many of the parts I have used and I have touched most of the vendors, the flash is at half clock speed of the core, so even at zero wait states you can only get an instruction every other clock (on average naturally with overhead rolled in) if fetching a halfword at a time, if fetching a word at a time which many of the cores offer then that is ideally two instructions per two clocks or one per. thumb2 of course you take the hit. ST definitely has a prefetcher/cacher thing with a fancy marketing name that does a pretty good job. Others may offer that as well or just rely on what arm offers which varies.

The different cortex-ms have different mixtures of busses. I hate the von-Neumann/Harvard references as there is little practical use for an actual Harvard architecture, thus the "modified" adjective which means they can do anything and try to attract folks taught in school that Harvard means performance. The busses can have multiple transactions in flight and there are a different number of busses as is somewhat obvious when you go in and release clocks for a peripheral, apb1 clock control ahb2 clock control, etc. Peripherals, flash, etc. But we can run code from sram, so it's not Harvard. Forget Harvard and von-Neumann terms and just focus on the actual implementation.

The bus documentation is as readily available as the core documentation. If you buy the right fpga board you can request a free eval of a core which you can then get an up close and personal view as to how it really works.

End of the day there is some parallelism, but on many chips the flash is half speed so if you are not fetching two per or have some other solution you are barely making it and stalling often if you have other same bus accesses. Likewise on many of these chips the peripherals cant run as fast as the core, so that alone incurs a stall, but even if the peripheral runs on the same clock doesn't mean it turns around a csr or data access as fast as sram, so you incur a stall there too.

There is no reason to assume you will get one instruction per clock performance out of these parts any more than a full sized arm or x86 or other.

While there are some important details that are not documented and only seen when you get the core there is documentation on each core and bus to get a rough idea if how to tune your code to perform better or tune your expectations of how it will really perform. I know I have demonstrated this here and elsewhere, it is pretty easy even with an ST to see a performance difference between flash and sram and see that it takes more clocks than instructions to perform a benchmark.

Your question is too broad in a few ways, the cortex-m0 and m3 are quite different, one was the first one out and dripping with features, the other was tuned for size and has just less stuff in general not meant to necessarily compete in this way. Then how long is the latency, etc, that is strictly chip company and family within the chip company so that questions extremely broad all the cortex-m products out there, dozens of different answers to that question. ARM makes cores not chips, the chip vendors make chips and buy IP from various places and make some of their own, some small part of that chip might be some ip they buy from a processor vendor.

0
votes

What you've described is known as the "von Neumann bottleneck", and in machines with a pure von Neumann architecture with shared data and program memory, accesses are usually interleaved. However you might like to check out the "modified Harvard architecture", because that's basically what is in use here. The backing store is shared like in a von Neumann machine, but the instruction and data fetch paths are separate like in a Harvard machine and crucially they have separate caches. So if an instruction or data fetch results in a cache hit, a memory fetch doesn't takes place and there is no bottleneck.

The second part of your question doesn't make a great deal of sense I'm afraid, because it is meaningless to talk about instruction fetch times in terms of instruction cycles. By definition, if an instruction fetch is delayed for some reason, the execution of that instruction (and subsequent instructions) must be delayed. Typically this is done by inserting NOPs into the pipeline until the next instruction is ready (known as "bubbling" the pipeline).

0
votes

re: part 2: Instruction fetch can be pipelined to hide some / all of the fetch latency. Cortex-M3 has a prefetch unit with a 3-word FIFO. https://developer.arm.com/documentation/ddi0337/e/introduction/prefetch-unit (This can hold up to six 16-bit Thumb instructions.)

This buffer can also supply instructions while data load/store is happening, in a config where those compete with each other (not Harvard split bus, and without a data or instruction cache).

This prefetch is of course speculative; discarded on branches. (It's simple and small enough not to be worth doing branch prediction to try to fetch from the right place before decode even knows the upcoming instruction stream contains a branch.)