Are load ops deallocated from the RS when they dispatch, complete or some other time?

Question

On modern Intel¹ x86, are load uops freed from the RS (Reservation Station) at the point they dispatch², or when they complete³, or somewhere in-between⁴?

¹ I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task.

² Dispatch here means leave the RS for execution.

³ Complete here means when the load data returns and is ready to satisfy dependent uops.

⁴ Or even somewhere outside of the range of time defined by these two events, which seems unlikely but possible.

Comments are not for extended discussion; this conversation has been moved to chat. — Bhargav Rao
@PeterCordes and BeeOnRopes a few questions about the chat: 1) re: L1/L2 cache line splits taking 2x + 1cycles. Could it be a memory ordering thing? I.e the CPU needs to make sure the two loads are consistent? 2) re: "So apparently the core spams the uops in case the load arrived in time for that cycle?" was this ever confirmed? BeeOnRope somewhat refuted it because it doesn't scale with L3 / RAM access but just want to confirm. Re: " instructions dependent on the load, that will dispatch 0 or 1 cycles after the load, are subject to replay" Would this scale for say... — Noah
movl (rax), edx; leal (rdx), ecx; leal (rdx), edi; leal (rdx), esi... On same ICL with 4 ports for lea would all 3 of the lea above be replayable? What if its more uops that RAT bandwidth? 4) If the uops are not replayed in a loop is there an idea for when they will get redispatched? Is it only if there is no contention for the port (hopefully) or can it actually add extra bottlenecks? 5) Will replay always be on the same port the instruction was dispatched too? — Noah
Is the RAT even involved in replays? I don't think the uop has to be renamed again, so I assumed it would be something downstream of that. I did some fair amount of investigation into replays but couldn't come up with a hard and fast rule. Almost always uops that could dispatch as soon as the load came back (e.g., all the lea in your example) would replay, but also uops that would dispatch a cycle later due to port conflicts and dependencies would often replay, and sometimes more than that. I couldn't come up with an exact bright line "horizon" in cycles from the load result where stuff \ — BeeOnRope
would replay: if I picked a specific number I found counter-examples on both sides. I can't remember if the same test repeated also showed variability or non-integer number of replays (averaged over may iterations), either. It is possible there is something involved in replay that operates at half frequency, or a structure where only a part of the structure is scanned each cycle, leading to variable replay behavior. — BeeOnRope

Andreas Abel Andreas Abel · Accepted Answer · 2020-01-27T23:38:14

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights.

On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, which is used for the following experiments.

We assume that R14 contains a valid memory address.

clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [R14]
mov RAX, [R14]
...
mov RAX, [R14]

mov RBX, [R14+512]

# stop measuring cycles

mov RAX, [R14] is unrolled 35 times. A load from memory takes at least about 280 cycles on this system. If the load uops stayed in the 33-entry reservation station until completion, the last load could only start after more than 280 cycles and would need another ~280cycles. However, the total measured time for this experiment is only about 340 cycles. This indicates that the load uops leave the RS at some time before completion.

In contrast, the following experiments shows a case where most uops are forced to stay in the reservation until the first load completes:

mov RAX, R14
mov [RAX], RAX
clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [RAX]
mov RAX, [RAX]
...
mov RAX, [RAX]

mov RBX, [R14+512]

# stop measuring cycles

The first 35 loads now have dependencies on each other. The measured time for this experiment is about 600 cycles.

The experiments were performed with all but one core disabled, and with the CPU governor set to performance (cpupower frequency-set --governor performance).

Here are the nanoBench commands I used:

./nanoBench.sh -unroll 1 -basic -asm_init "clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RBX, [R14+512]"

./nanoBench.sh -unroll 1 -basic -asm_init "mov RAX, R14; mov [RAX], RAX; clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RBX, [R14+512]"

Are load ops deallocated from the RS when they dispatch, complete or some other time?

2 Answers