5
votes

On modern Intel1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch2, or when they complete3, or somewhere in-between4?


1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task.

2 Dispatch here means leave the RS for execution.

3 Complete here means when the load data returns and is ready to satisfy dependent uops.

4 Or even somewhere outside of the range of time defined by these two events, which seems unlikely but possible.

2
Comments are not for extended discussion; this conversation has been moved to chat.Bhargav Rao
@PeterCordes and BeeOnRopes a few questions about the chat: 1) re: L1/L2 cache line splits taking 2x + 1cycles. Could it be a memory ordering thing? I.e the CPU needs to make sure the two loads are consistent? 2) re: "So apparently the core spams the uops in case the load arrived in time for that cycle?" was this ever confirmed? BeeOnRope somewhat refuted it because it doesn't scale with L3 / RAM access but just want to confirm. Re: " instructions dependent on the load, that will dispatch 0 or 1 cycles after the load, are subject to replay" Would this scale for say...Noah
movl (rax), edx; leal (rdx), ecx; leal (rdx), edi; leal (rdx), esi... On same ICL with 4 ports for lea would all 3 of the lea above be replayable? What if its more uops that RAT bandwidth? 4) If the uops are not replayed in a loop is there an idea for when they will get redispatched? Is it only if there is no contention for the port (hopefully) or can it actually add extra bottlenecks? 5) Will replay always be on the same port the instruction was dispatched too?Noah
Is the RAT even involved in replays? I don't think the uop has to be renamed again, so I assumed it would be something downstream of that. I did some fair amount of investigation into replays but couldn't come up with a hard and fast rule. Almost always uops that could dispatch as soon as the load came back (e.g., all the lea in your example) would replay, but also uops that would dispatch a cycle later due to port conflicts and dependencies would often replay, and sometimes more than that. I couldn't come up with an exact bright line "horizon" in cycles from the load result where stuff \BeeOnRope
would replay: if I picked a specific number I found counter-examples on both sides. I can't remember if the same test repeated also showed variability or non-integer number of replays (averaged over may iterations), either. It is possible there is something involved in replay that operates at half frequency, or a structure where only a part of the structure is scanned each cycle, leading to variable replay behavior.BeeOnRope

2 Answers

5
votes

The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights.

On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, which is used for the following experiments.

We assume that R14 contains a valid memory address.

clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [R14]
mov RAX, [R14]
...
mov RAX, [R14]

mov RBX, [R14+512]

# stop measuring cycles

mov RAX, [R14] is unrolled 35 times. A load from memory takes at least about 280 cycles on this system. If the load uops stayed in the 33-entry reservation station until completion, the last load could only start after more than 280 cycles and would need another ~280cycles. However, the total measured time for this experiment is only about 340 cycles. This indicates that the load uops leave the RS at some time before completion.

In contrast, the following experiments shows a case where most uops are forced to stay in the reservation until the first load completes:

mov RAX, R14
mov [RAX], RAX
clflush [R14]
clflush [R14+512]
mfence

# start measuring cycles

mov RAX, [RAX]
mov RAX, [RAX]
...
mov RAX, [RAX]

mov RBX, [R14+512]

# stop measuring cycles

The first 35 loads now have dependencies on each other. The measured time for this experiment is about 600 cycles.

The experiments were performed with all but one core disabled, and with the CPU governor set to performance (cpupower frequency-set --governor performance).

Here are the nanoBench commands I used:

./nanoBench.sh -unroll 1 -basic -asm_init "clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RBX, [R14+512]"

./nanoBench.sh -unroll 1 -basic -asm_init "mov RAX, R14; mov [RAX], RAX; clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RBX, [R14+512]"

5
votes

Just came across this question. Here is my attempt at an answer.

Short Answer: I'm still a bit uncertain about some parts but based on some measurements using various performance counters along with performance monitoring interrupts, it "looks like" the load uop gets removed from RS during the same cycle it is dispatched to load ports or at least very shortly afterwards.

Details: A while ago I tried writing a kernel module which mimics the ideas here. The blog post linked describes the idea really well so I won't explain it in detail here. The main idea is to trigger a performance monitoring interrupt after a set number of cycles have elapsed, freeze all counter values (currently tracked), store them and reset/repeat. Doing this for 1, 2, ... n cycles gives us some picture of what is going on micro-architecturally at the cycle granularity. How accurate of a picture is a different story... The source for the kernel module I used for measuring can be found here.

Long Answer: I profiled the following code below using the kernel module mentioned above on a i7-1065G7 (Ice Lake) and tracked 11 different performance counters. Prior to the mov instruction profiled, clflush was called on the address stored in r8. This was done so that the load would take long enough to make it easy to tell whether the uop was removed from RS before, after or during execution (otherwise the load completes in about 4 cycles). In total I measured up to 600 cycles with most of the events which are of interest in this question happening within 65 cycles. To account for noise I did 1024 trials for each cycle and stored the counter value which occurred the most. Luckily for each cycle in the chart below and each counter I only saw deviations in value from at most a single trial with the remaining 1023 trials giving the same counter values.

 563:   0f 30                   wrmsr  
 565:   4d 8b 00                mov    (%r8),%r8
 568:   0f ae f0                mfence 
 56b:   0f ae e8                lfence

The counters tracked are listed below. Descriptions are summarized from Intel SDM.

  INST_RETIRED_ANY_P:          To track when wrmsr retired
  RS_EVENTS_EMPTY_CYCLES:      Count of cycles RS is empty
  UOPS_DISPATCHED_PORT_PORT_0: # uops dispatched to port 0
  UOPS_DISPATCHED_PORT_PORT_1: # uops dispatched to port 1 
  UOPS_DISPATCHED_PORT_2_3:    # uops dispatched to port 2,3 (load addr ports)
  UOPS_DISPATCHED_PORT_4_9:    # uops dispatched to port 4,9 (store data ports)
  UOPS_DISPATCHED_PORT_PORT_5: # uops dispatched to port 5
  UOPS_DISPATCHED_PORT_PORT_6: # uops dispatched to port 6
  UOPS_DISPATCHED_PORT_7_8:    # uops dispatched to port 7,8 (store addr ports)
  UOPS_EXECUTED_THREAD:        # uops executed
  UOPS_ISSUED_ANY:             # uops sent to RS from RAT

The table below lists each counter value at each cycle. So based on the table below one uop is sent to RS at cycle 47 and occupies the RS for cycles 51-54. This is presumably the load uop. At cycle 54 RS_EVENTS_EMPTY_CYCLES and UOPS_DISPATCHED_PORT_2_3 increment which means (at least how I'm interpreting it) that the load uop has been dispatched and is freed from the RS.

What I'm not sure about is that at cycle 52 three more uops are issued to the RS. They seem to arrive and occupy the RS for cycle 55-58. But only two uops are dispatched to the execution ports and the RS is emptied. Regardless by cycle 59 the RS is empty (count is increasing each cycle). The load completes and mov retires about 500 cycles later.

+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
| Cycle | Inst Retired | Cycles RS Empty | Port 0 | Port 1 | Port 2,3 | Port 4,9 | Port 5 | Port 6 | Port 7,8 | uops executed | uops issued to RS |        Comments        |
+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
|     1 |            0 |               3 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     2 |            0 |               4 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     3 |            0 |               5 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
|     4 |            0 |               6 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 | 2 uops issued          |
|     5 |            0 |               7 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     6 |            0 |               8 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     7 |            0 |               9 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     8 |            0 |              10 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|     9 |            0 |              11 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    10 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    11 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    12 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    13 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
|    14 |            0 |              13 |      0 |      0 |        0 |        0 |      0 |      1 |        0 |             3 |                 2 |                        |
|    15 |            0 |              14 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             3 |                 2 | 2 uops dispatched      |
|    16 |            0 |              15 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             4 |                 2 |                        |
|    17 |            0 |              16 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 | 2 uops executedd       |
|    18 |            0 |              17 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    19 |            0 |              18 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    20 |            0 |              19 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    21 |            0 |              20 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    22 |            0 |              21 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
|    23 |            0 |              22 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 5 |                        |
|    24 |            0 |              23 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 | 4 uops issued          |
|    25 |            0 |              24 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    26 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    27 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    28 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    29 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    30 |            0 |              25 |      0 |      1 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
|    31 |            0 |              26 |      0 |      1 |        0 |        0 |      0 |      3 |        0 |             5 |                 6 |                        |
|    32 |            0 |              27 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             6 |                 6 |                        |
|    33 |            0 |              28 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             7 |                 6 |                        |
|    34 |            0 |              29 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | 3 uops executed        |
|    35 |            0 |              30 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    36 |            1 |              31 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | wrmsr retired          |
|    37 |            1 |              32 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    38 |            1 |              33 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    39 |            1 |              34 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    40 |            1 |              35 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    41 |            1 |              36 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    42 |            1 |              37 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    43 |            1 |              38 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    44 |            1 |              39 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    45 |            1 |              40 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    46 |            1 |              41 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    47 |            1 |              42 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
|    48 |            1 |              43 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 | 1 uop issued           |
|    49 |            1 |              44 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    50 |            1 |              45 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    51 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
|    52 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | 3 uops issued          |
|    53 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
|    54 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | port 2,3 load addr     |
|    55 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
|    56 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 | executing load         |
|    57 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 |                        |
|    58 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 | port 4,9 store data    |
|    59 |            1 |              48 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 | port 7,8 store address |
|    60 |            1 |              49 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 |                        |
|    61 |            1 |              50 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 | 2 uops executed        |
|    62 |            1 |              51 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    63 |            1 |              52 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    64 |            1 |              53 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
|    65 |            1 |              54 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
+-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+

So based on the table it looks like the load uop is removed from the RS either at the same time as dispatching to load port or a couple of cycles later. I did some sanity checking of the values in the chart and for the most part all the counter values makes sense. Two things I haven't figure out is the fact that 4 uops are to be sent to RS (cycle 24) but only 3 gets executed (cycle 35). Similarly 3 uops is issued at cycle 52, but only 2 are executed (cycle 61)

Thanks