Cortex-M4F lazy FPU stacking

Question

I'm writing threading code for a Cortex M4F. Everything's working and I'm now looking into making FPU context switching more efficient via lazy stacking.

I've read ARM's AN298 and I implemented the alternative approach based on disabling FPU and handling UsageFault, but the lower (S0-S15) registers are not being saved/restored correctly by the hardware. I think the problems lies in figure 11:

According to this, when PendSV runs FPCAR should point to the space reserved in Task A's stack. But as I see it, since CONTROL.FPCA is high in Task C, FPCAR will be updated to point to Task C's stack when entering PendSV. If so, S0-S15 and FPSCR will be saved to Task C's stack instead of Task A's, which is of course not correct.

Am I missing something here, or is the appnote wrong?

One a side note, I checked some open source RTOSes. FreeRTOS and mbed RTOS always stack S16-S31 during the context switch, resulting in automatic S0-S15 stacking, i.e. they make use of lazy stacking only to reduce interrupt latency but do full state preservation for tasks (as in the first approach outlined in the appnote). The TNKernel port for M4F uses the UsageFault approach, but fully saves/restores S0-S31 via software, effectively bypassing any problem with FPCAR (at the cost of 48 load/stores instead of 32, the 16 hardware ones get overwritten on restore). Nobody seems to be using the UsageFault approach while only preserving S16-S31.

(By the way, this is also posted at ARM Community, but a lot of questions seem to go unanswered there. If I get an answer there, I'll replicate it here, too)

From the App Note: "The FPCAR register points to a section of stack space within the current stack..." So it should point to the stack of the pre-empted task. — D Krueger
@DKrueger Exactly. But the drawing implies it still points to Task A's frame when Task C is preempted by PendSV. That's why I'm confused, I don't know if I misunderstood it or if something in the appnote is wrong. — Andrea Biondo
I think this indicates that FPCAR won't be updated when the FPU is disabled. That's why it continues to point to Task A's stack until Task C is pre-empted with the FPU enabled. Saving the FPU context can then be put off until another task actually requires the FPU. — D Krueger
@DKrueger You have a very good hypothesis. The appnote says FPCAR update depends on FPCA, no mention of FPU state (logic would be a little complex too, considering that it can also be enabled only for privileged code, etc). But I'll put it to the test in the morning. — Andrea Biondo
@DKrueger FPCAR is updated even when FPU is disabled, I tested it. I was on vacation and didn't reply, but I've now found out how to do it properly. — Andrea Biondo

Andrea Biondo Andrea Biondo · Accepted Answer · 2016-08-08T11:48:08

It took a while, but in the end I found out how to do this as efficiently as possible.

First off, the appnote is wrong. My initial explanation on the way FPCAR is updated is right. Note that FPCAR is updated even when the FPU is disabled. Also, by testing, I determined FPCAR to indeed always point to the interrupted stack.

My first approach was to manipulate FPCAR, LSPACT and EXC_RETURN, along with the UsageFault pending PendSV. Of course to do this it's essential that FPCAR manipulation doesn't count as an FPU operation from a lazy stacking perspective. When the documentation is lacking, we can only hack the answers out of the CPU...

LDR  R2, =0xE000EF38
LDR  R3, =0xDEADBEEF
STR  R3, [R2]
VSTM R1, {S16-S31}
UDF

FPCAR is at 0xE000EF38. VSTM is part of the context-saving routine. The idea is that, if FPCAR manipulation is an FPU op, lazy stacking will halt the FPCAR store and will succeed since FPCAR is still valid. This will fault on UDF. Otherwise, lazy stacking will happen on VSTM with a corrupted FPCAR, resulting in a bus fault.

Indeed, I got a bus fault. Yay! I repeated the test with a valid address: no fault, works perfectly. So saving is simple enough. Restoring requires pending PendSV and manipulating FPCAR, LSPACT and EXC_RETURN inside it to cause S0-S15 for the current thread to be restored on exception return. The problem here is that you can't keep state for the current thread on its stack, as it's going to be popped off. Copying is inefficient, so the best bet is to point FPCAR to the persistent TCB state instead of saving the CPU-generated one.

This is getting quite complex, it requires to perform a PendSV after the UsageFault, and it has quite some corner cases and races. There's a better way.

The approach I ended up using runs completely inside UsageFault and bypasses hardware stacking, without losing efficiency over it. After enabling the FPU and determining an FPU context switch is required, I:

Set LSPACT to zero;
Save/restore the full S0-S31 state to/from the TCB;
Set LSPACT back to one.

By doing this, I can work on the whole S0-S31 state without lazy stacking getting on the way, because the CPU thinks it has already stacked the context since LSPACT is zero. This of course relies on the UsageFault handler not using FPU ops outside of save/restore and not being preempted by FPU-using ISRs, which are pretty trivial assumptions given it's hand-coded ASM and fault handlers can't be preempted by ISRs. I also tried disabling lazy stacking via ASPEN/LSPEN instead of working on LSPACT, but it doesn't seem to work (it still triggers lazy stacking, verified by setting an invalid FPCAR).

Efficiency-wise, this is as efficient as hardware stacking. If I wanted to nitpick, it saves one cycle as I don't need to writeback the incremented pointer.

By the way, I included the first approach even though I didn't end up using it because I think it has some useful info in there, if anyone else comes looking for this.

Cortex-M4F lazy FPU stacking

1 Answers