Yes this is how it happens, the processor stalls a lot, this goes on with big processors as well as small, difficult at best to feed a pipelined processor (although some of these pipes are shallow on cortex-ms, but pipelined nevertheless).
Many of the parts I have used and I have touched most of the vendors, the flash is at half clock speed of the core, so even at zero wait states you can only get an instruction every other clock (on average naturally with overhead rolled in) if fetching a halfword at a time, if fetching a word at a time which many of the cores offer then that is ideally two instructions per two clocks or one per. thumb2 of course you take the hit. ST definitely has a prefetcher/cacher thing with a fancy marketing name that does a pretty good job. Others may offer that as well or just rely on what arm offers which varies.
The different cortex-ms have different mixtures of busses. I hate the von-Neumann/Harvard references as there is little practical use for an actual Harvard architecture, thus the "modified" adjective which means they can do anything and try to attract folks taught in school that Harvard means performance. The busses can have multiple transactions in flight and there are a different number of busses as is somewhat obvious when you go in and release clocks for a peripheral, apb1 clock control ahb2 clock control, etc. Peripherals, flash, etc. But we can run code from sram, so it's not Harvard. Forget Harvard and von-Neumann terms and just focus on the actual implementation.
The bus documentation is as readily available as the core documentation. If you buy the right fpga board you can request a free eval of a core which you can then get an up close and personal view as to how it really works.
End of the day there is some parallelism, but on many chips the flash is half speed so if you are not fetching two per or have some other solution you are barely making it and stalling often if you have other same bus accesses. Likewise on many of these chips the peripherals cant run as fast as the core, so that alone incurs a stall, but even if the peripheral runs on the same clock doesn't mean it turns around a csr or data access as fast as sram, so you incur a stall there too.
There is no reason to assume you will get one instruction per clock performance out of these parts any more than a full sized arm or x86 or other.
While there are some important details that are not documented and only seen when you get the core there is documentation on each core and bus to get a rough idea if how to tune your code to perform better or tune your expectations of how it will really perform. I know I have demonstrated this here and elsewhere, it is pretty easy even with an ST to see a performance difference between flash and sram and see that it takes more clocks than instructions to perform a benchmark.
Your question is too broad in a few ways, the cortex-m0 and m3 are quite different, one was the first one out and dripping with features, the other was tuned for size and has just less stuff in general not meant to necessarily compete in this way. Then how long is the latency, etc, that is strictly chip company and family within the chip company so that questions extremely broad all the cortex-m products out there, dozens of different answers to that question. ARM makes cores not chips, the chip vendors make chips and buy IP from various places and make some of their own, some small part of that chip might be some ip they buy from a processor vendor.