2
votes

I'm doing some evaluations on STM32H7, on the STM32H753I-EVAL2 board. I used STMicro example code to configure, write and read the QSPI Flash in memory mapped mode.

I was surprised by some figures regarding duration of LDR instruction:

  • I measure the number of cycles of instructions using the SysTick (connected on CPU clock). As far as I understood: one cycle of SysTick = one cycle of CPU.

  • I measured two instructions exactly identical ldrb.w Rn, [Rp, Rq] except that Rp is in one case an address in DTC-RAM and in the other case an address in QSPI Flash.

The results are (code executed from internal flash): 15 cycles from DCTM-RAM, 12 cycles from QSPI.

I'm surprised by the results, I guess the QSPI content if cached so it might explain the figures ?

Also I find that 15 cycles for a single LDR instruction seems quite a lot, what do you think ? Is there something wrong in my procedure ?

1
Did the code itself run from internal flash? Was it aligned to cache line boundary? Was RAM used for something else meanwhile, e.g. displaying graphics?followed Monica to Codidact
Did you measure an unrolled block of multiple load instructions to hide measurement overhead? I'm not sure if Cortex-M can start executing the next instruction while a previous load is still in flight. So measuring a single instruction in isolation might not be representative, depending on exactly how simple Cortex-M is.Peter Cordes
@berendi Yes code excuted from internal flash, I didnn't modify the mapping and didn't check for cache line alignement. RAM wasn't used for something else.Guillaume Petitjean
Your CPU does have 16kiB each of L1i and L1d cache. (So a loop could run from cache, unless cache is disabled for that mem region or entirely). STM32H753I-EVAL2 uses a st.com/en/microcontrollers-microprocessors/stm32h753xi.html which is a Cortex-M7 core, which is dual issue superscalar. So yeah, testing a single instruction is probably not great. Depends what you want to measure, but normally execution of an instruction will be able to overlap with one before or after if you schedule instructions well.Peter Cordes
A good way to test cached load throughput would be to put a big block of ldrb or ldr instructions in a loop, so you have like 1 instruction of loop overhead per 256 loads or something. You want the loop to fit in instruction-cache, unless you want to test competition for I-fetch too. (Use different registers to avoid WAW hazards (or not to see if loading the same reg repeatedly causes a bottleneck), and use offsets in the addressing modes if you want to load from different cache lines. e.g. to make them all alias the same set and get cache misses.)Peter Cordes

1 Answers

0
votes

If the internal flash is not cached, or the cache is invalid, or the pipeline was flushed or ... (many many other)s it may take more time than the QSPI Flash located instruction.

To measure execution time you have special registers.