9
votes

I was just going over this answer by Peter Cordes and he says,

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?

2
Peter Cordes and others probably have a more comprehensive explanation but, the way I understand it, flag bits are renamed separately in register renaming. For the instructions that set all flag bit, which is the majority, the state of all those "registers" can be reset all at once, but for instructions that only affect a sub-set of the flag bits, the actual flag values need to be merged from the current instruction as well as the last one that set the remaining flag bits, if that makes sense. This merging (sometimes) takes extra time.500 - Internal Server Error
My mental model was just that the instruction operated on a global flag register in serial? Is that not true? Look forward to Peter's answer if he buzzes in.Evan Carroll
@EvanCarroll: EFLAGS is renamed of course. How could add have 4 per clock throughput if you didn't break the WAW hazard? (And yes, different groups of flags are renamed separately, so inc can also have 4 per clock throughput and no input dependency on FLAGS, like how some Intel CPUs can rename ah separately from al when they're written separately.) Working on an answer, but see Agner Fog's microarch guide: agner.org/optimize. He explains partial-flag stalls and merges.Peter Cordes
I'm going to shut up and await the answer. I won't lie to having Amazon-d your name a few times. Just take my money in the event you ever put out a book on x86, Linux, or Radare.Evan Carroll

2 Answers

10
votes

Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.

So an instruction like inc that sets only some flags (it doesn't set CF) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF) that was not set by inc (without any intervening instruction that sets the CF flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.

So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc family and cmovcc and a few specialized instructions like adc) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.

Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.

Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe which has two register inputs, and whose condition be requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.

Examples

Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".

Stall and merging uop

The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
ja  label    ; reads CF and ZF

The ja instruction reads CF and ZF which were last set by the add and inc instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja. On architectures that stall, a stall occurs because ja reads from CF which was not set by the most recent flag setting instruction.

Stall only

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jc  label    ; reads CF

This causes a stall because as in the prior example CF is read which is not set by the last flag setting instruction (here inc). In this case, the stall could be avoided by simply swapping the order of the inc and add since they are independent and then the jc would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF) all come from the same add instruction.

Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.

No stall or merging uop

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jnz  label   ; reads ZF

Here there is no stall or merging uop needed, even though the last instruction (inc) only sets some flags, because the consuming jnz only reads (a subset of) flags set by the inc and no others. So this common looping idiom (usually with dec instead of inc) doesn't inherently cause a problem.

Here's another example that doesn't cause any stall or merge uop:

inc rax      ; sets ZF, but not CF
add rbx, 5   ; sets CF, ZF, others
ja  label    ; reads CF and ZF

Here the ja does read both CF and ZF and an inc is present which doesn't set ZF (i.e., a partial flag writing instruction), but there is no problem because the add comes after the inc and writes all the relevant flags.

Shifts

The shift instructions sar,shr and shl in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").

I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.

Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.


1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.

1
votes

A flag modifying uop may only update part of the flags register. The RAT has one entry for the flags/eflags/rflags register and a mask showing the flags that are changed by the uop that caused the physical register the entry is pointing to to be assigned. If a series of instructions occur that read and write the same flag, then a separate physical register gets assigned for each write and each read uses the previous physical register. In those registers will be written that flag and all other flags will be clear. That's why the current physical register cannot be used when a read from a different flag that is not in the mask in the flags RAT entry, because it would read a clear bit and not the real state of the flag that has been left behind. On old microarchitectures, a stall occurs until the state of the flags register is valid in the RRF (by waiting for the retirement of each flag setting uop before it to insert the bits they set in the RRF flags register, where each uop is examined to know the architectural registers it uses / flags it changes, which is in an easier format to interpret than x86 macroops).

On microarchitectures that use the PRF scheme (SnB onwards), a merging uop is required to keep a unified flags register when there is no dedicated RRF register, otherwise the retirement RAT would be pointing to a meaningless physical register with only 1 of the flags in. The merging uop occurs after every partial-flags modifying instruction like inc or dec. add modifies all 6 status flags and therefore does not require a merge uop. I think this probably implies that status, control and system flags are renamed separately on the PRF scheme, given that add does not require a merging uop. Apparently the CF flag is renamed differently to the SPAZO cluster.

Partial register stalls are similar. The RAT has 2 entries to represent rax: an entry for al/ax/eax/rax (distinguished by a size indicator in the entry) and ah (both are updated on a write to ax, eax or rax to point to the same register). It only needs 2 to represent because there are only 2 mutually exclusive registers. If a read from eax occurs before a previous write to one of the smaller registers retires, then the allocator stalls (because the ROB entry cannot have 2 dependencies for the same operand) until the full register is present in the RRF, and then it will rename both entries to the RRF register for rax.

In later microarchitectures that use the PRF scheme, this is now difficult because a single RRF for rax is no longer kept. Therefore, a merging uop needs to be used, which also happens to be faster than the stall method of the previous microarchitectures.

merging uop implementations

  1. One implementation of the merging uop could be that it is inserted before every write to a partial flag / register, and the merging uop reads from the full register / flags register before writing it all to a new physical register. The write is then allocated the same register, which results in the write naturally merging itself in. The following read can then read any part of the register / any flag. This basically sets up a dependency chain between every partial-flag writing instruction and a previous flag writing instruction (partial or full) and between every partial register write and a previous (full / partial) write to the register. In this instance, the RAT never has partial renames.

  2. It could be allocated immediately after the write to a partial register. The merge uop takes the previous physical register (which will always be a full rax/eax write, or in the case of flags, a full status flag update, like that which is done by add or the merge uop) and the new physical register and combines them into the new physical register. This would suggest that the allocator inserts it. If it were inserted by the decoder, the allocator could allocate that uop in a different cycle, when the previous RAT pointer is unknown.

  3. It could be allocated immediately before a read that occurs from a register that has an unified state in the RAT. This would imply that the RAT tracks rax/eax separately to ax, al and ah. In this case, the 2 physical registers that need to be merged are taken from the RAT.

The optimisation manual implies it is one of the latter 2 scenarios 'The merging uop occurs after every partial register write' (i.e. a write to ax, al or ah, but not eax).