I'm doing some evaluations on STM32H7, on the STM32H753I-EVAL2 board. I used STMicro example code to configure, write and read the QSPI Flash in memory mapped mode.
I was surprised by some figures regarding duration of LDR instruction:
I measure the number of cycles of instructions using the SysTick (connected on CPU clock). As far as I understood: one cycle of SysTick = one cycle of CPU.
I measured two instructions exactly identical
ldrb.w Rn, [Rp, Rq]
except that Rp is in one case an address in DTC-RAM and in the other case an address in QSPI Flash.
The results are (code executed from internal flash): 15 cycles from DCTM-RAM, 12 cycles from QSPI.
I'm surprised by the results, I guess the QSPI content if cached so it might explain the figures ?
Also I find that 15 cycles for a single LDR instruction seems quite a lot, what do you think ? Is there something wrong in my procedure ?
ldrb
orldr
instructions in a loop, so you have like 1 instruction of loop overhead per 256 loads or something. You want the loop to fit in instruction-cache, unless you want to test competition for I-fetch too. (Use different registers to avoid WAW hazards (or not to see if loading the same reg repeatedly causes a bottleneck), and use offsets in the addressing modes if you want to load from different cache lines. e.g. to make them all alias the same set and get cache misses.) – Peter Cordes