delays and measurement of specific instructions

Question

Because modern processor makes use of heavy pipeline even for ALU, multiple executions of independent arithmetic operations can be executed in one cycle, for example, four add operations can be executed in 4 cycles not 4 * latency of one add.

Even the existence of the pipelines, and the presence of contention on the execution ports, I would like to implement cycle-accurate delays by executing some instructions in a way that time to execute a sequence of instructions is predictable. For example, if instruction x takes 2 cycles, and cannot be pipelined, then by executing x four-time, I expect that I can put 8 cycle delays.

I know that this can be usually impossible for userspace because the kernel can intervene in between execution sequence and could result in more delay then expectation. However, I assume that this code executes in the kernel side without interrupts or isolated core which is free from noise.

After taking a look at https://agner.org/optimize/instruction_tables.pdf, I found that CDQ instruction doesn't require memory operation and takes 1 cycle in its latency and reciprocal throughput. If I understand this correctly, this means that if there is no contention for the port used by CDQ, it can execute this instruction at every cycle. To test it, I put the CDQ in between RDTSC timer and set core frequency as nominal core frequency (with the hope that it is the same as TSC cycle). Also I pinned two processes to hyperthreaded cores; one falls in the while(1) loop and the other executes CDQ instruction. It seems that adding one instruction increases 1-2 TSC cycles.

However, I am concern about the case when it requires lots of CDQ instructions to put large delays such as 10000 which might require at least 5000 instructions. If the code size is too large to fit in the Instruction cache and cause cache miss and TLB miss, it might introduce some jitters in my delay. I've tried to use simple for loop to execute CDQ instructions, but cannot assure whether it is okay to use for loop (implemented with jnz,cmp, and sub) because it might also introduce some unexpected noise in my delay. Could anyone confirm if I can use the CDQ instruction in this way?

Added Question

After testing with multiple CMC instructions, it seems that 10 CMC instruction adds 10 TSC cycles. I used below code to measure time for executing 0, 10, 20, 30, 40, 50

    asm volatile(                                                                                                                                                                                                                                                                               
        "lfence\t\n"                                                                                                                                                                                                                                                                            
        "rdtsc\t\n"                                                                                                                                                                                                                                                                             
        "lfence\t\n"                                                                                                                                                                                                                                                                            
        "mov %%eax, %%esi\t\n"
                                                                                                                                                                                                                                                                                                
        "cmc\n\t" // CMC * 10, 20, 30, 40, ...
                                                                                                                                                                                                                                                                                                
        "rdtscp\n\t"                                                                                                                                                                                                                                                                            
        "lfence\t\n"                                                                                                                                                                                                                                                                            
        "sub %%esi, %%eax\t\n"
        :"=a"(*res)
        :
        : "ecx","edx","esi", "r11"
    );

    printf("elapsed time:%d\n", *res);

I got 44-46, 50-52, 62-64, 70-72, 80-82, 90-92 for (no CMC, 10CMC, 20CMC, 30CMC, 40CMC, 50CMC). When the RDTSC results varies 0~2 TSC cycles at every execution, it seems that 1CMC instruction maps to 1cycle latency. Except for the first time of adding 10 CMC (it doesn't increase 10 but 6~8), most of the time adding 10 more CMC instructions add (10 +-2)more TSC cylces. However, when I changed CMC to CDQ instruction as I originally used in the question, it seems that 1 CDQ instruction doesn't map to 1cycle in i9900K machine. However, when I look at the agner's optimization table, it seems that CMC and CDQ instruction is not different really. Does it because CMC instructions back to back have not a dependency on each other but CDQ instructions do have a dependency in between them?

Also if we consider the variable latency has been caused by the rdtsc not because of interrupt or other contention issues.. then it seems that CMC instruction can be used for delaying 1 core cycle right? Because I pinned my core to run at 3.6GHz clock frequency which assumed to be a TSC clock frequency on i9900k.. I did take a look at the referenced question but cannot catch the exact details..

Other than chips like the non-mips PIC microcontrollers and other with similar situations, you cannot do this outside of lets say a test environment. In sim sure, with controlled fetches or perfectly repeatable execution where you can hand tune it, but in the real world with even one motherboard but certainly across the x86 world, not possible to do this consistently. Not something to spend time attempting. You want accurate timing you use a peripheral to do it, that is designed to do the thing you want to do. — old_timer
In software you can often do the "at least this long" I can make it no less than this amount of time but it could be much longer. which is useful for working with various peripherals, or other things, but not for accurate no more than and no less than timing. — old_timer
If for some reason you're looking for a one-byte instruction that definitely has a true dependency on itself (limiting execution to a latency bottleneck of 1/clock), cmc (toggle CF) might do the trick. But that's barely useful for your overall goal of a delay loop. Possibly combined with lfence, but that could delay for much longer than you want depending on how long any existing in-flight instructions take. e.g. a cache-miss load. — Peter Cordes
@JaehyukLee: For "extremely small and accurate" delays the only option is to give up then reassess the reason why you made the mistake of thinking you wanted it. For kernel code; for longer delays with slightly less accuracy you could look into using local APIC timer in "TSC deadline mode" (possibly with some adjustment for IRQ exit timing) and/or similar with performance monitoring counters. — Brendan

Peter Cordes Peter Cordes · Accepted Answer · 2020-09-23T22:05:03

You have 4 main options:

delay the 2nd operation by giving it a data dependency on (the result of) the first.
lfence, fixed delay sequence, lfence. Both of these can only give a minimum delay; could be much longer depending on CPU frequency scaling and/or interrupts.
spin on rdtsc until a deadline (which you calculate somehow, e.g. based on an earlier rdtsc), or do a longer sleep based on a TSC deadline e.g. using the local APIC.
Give up and use a different design, or use an in-order microcontroller where you can get reliable cycle-accurate timing at a fixed clock frequency.

This may be an X-Y problem, or at least isn't solvable without getting into the specific details of the two things you want to separate with a delay. (e.g. create a data dependency between a load and a store-address, and lengthen that dep chain with some instructions). There is no general-case answer that works between arbitrary code for very short delays.

If you need accurate delays of only a few clock cycles, you're mostly screwed; superscalar out-of-order execution, interrupts, and variable clock frequency makes that essentially impossible in the general case. As @Brendan explained:

For "extremely small and accurate" delays the only option is to give up then reassess the reason why you made the mistake of thinking you wanted it.

For kernel code; for longer delays with slightly less accuracy you could look into using local APIC timer in "TSC deadline mode" (possibly with some adjustment for IRQ exit timing) and/or similar with performance monitoring counters.

For delays of several dozen clock cycles, spin-wait for RDTSC to have a value you're looking for. How to calculate time for an asm delay loop on x86 linux? But that has some minimum overhead to execute RDTSC twice, or RDTSC plus TPAUSE if you have the "waitpkg" ISA extension. (You don't on i9-9900k). You also need lfence if you want to stop out-of-order exec across the whole thing.

If you need to do something "every 20 ns" or something, then increment a deadline instead of trying to do a fixed delay between other work. So variation in the other work won't accumulate error. But one interrupt will put you far behind and lead to running your other work back-to-back until you catch up. So as well as checking for the deadline, you'd also want to check for being far behind the deadline and take a new TSC sample.

(The TSC ticks at constant frequency on modern x86, but the core clock doesn't: see How to get the CPU cycle count in x86_64 from C++? for more details)

Maybe you can use a data dependency between your real work?

Small delays of a few clock cycles, smaller than the out-of-order scheduler size¹, are not really possible without taking the surrounding code into consideration and knowing the exact microarchitecture you're executing on.

footnote 1: 97 entry RS on Skylake-derived uarches, although there's some evidence that it's not truly a unified scheduler: some entries can only hold some kinds of uops.

If you can create a data dependency between the two things you're trying to separate, you might be able to create a minimum delay between their execution that way. There are ways to couple a dependency chain into another register without affecting its value, e.g. and eax, 0 / or ecx, eax makes ECX depend on the instruction that wrote EAX without affecting the value of ECX. (Make a register depend on another one without changing its value).

e.g. between two loads, you could create a data dependency from the load result of one into the load address of the later load, or into a store address. Coupling two store addresses together with a dependency chain is less good; the first store could take a bunch of extra time (e.g. for a dTLB miss) after the address is known, so two stores end up committing back-to-back after all. You might need mfence then lfence between two stores if you want to put a delay before the 2nd store. See also Are loads and stores the only instructions that gets reordered? for more about OoO exec across lfence (and mfence on Skylake).

This may require writing your "real work" in asm, too, unless you can come up with a way to "launder" the data dependency from the compiler with a small inline asm statement.

CMC is one of the few single-byte instructions available in 64-bit mode that you can just repeat to create a latency bottleneck (1 cycle per instruction on most CPUs) without also accessing memory (like lodsb which bottlenecks on merging into the low byte of RAX). xchg eax, reg would also work, but that's 3 uops on Intel.

Instead of lfence, you could couple that dep chain into a specific instruction using adc reg, 0, if you start with a known CF state and use an odd or even number of CMC instructions such that CF=0 at that point. Or cmovc same,same would make a register value depend on CF without modifying it, regardless of whether CF was set or cleared.

However, single-byte instructions can create weird front-end effects when you have too many in a row for the uop cache to handle. That's what slows down CDQ if you repeat it indefinitely; apparently Skylake can only decode it at 1/clock in the legacy decoders. Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?. That may be ok and/or what you want. 3 cycles per 3-byte instruction would let this code be cached by the uop cache, e.g imul eax, eax or imul eax, 0. But maybe it's better to avoid polluting the uop cache with code that's supposed to run slowly.

Between LFENCE instructions, cld is 3 uops and has a 4c throughput on Skylake, so if you're using lfence at the start/end of your delay that could be usable.

Also of course, any dead-reckoning delay in terms of a certain number of some instructions (not rdtsc) will depend on the core clock frequency, not the reference frequency. And at best it's a minimum delay; if an interrupt comes in during your delay loop, the total delay will be close to the total of interrupt handling time plus whatever your delay-loop took.

Or if the CPU happens to be running at idle speed (often 800MHz), the delay in nanoseconds will be much longer than if the CPU is at max turbo.

Re: your 2nd experiment with CMC between lfence OoO exec barriers

Yes, you can pretty accurately control the core clock cycles between two lfence instructions, or between lfence and rdtscp, with a simple dependency chain, pause instruction, or a throughput bottleneck on some execution unit(s), possibly the integer or FP divider. But I assume your real use case cares about the total delay between stuff before the first lfence and stuff after the 2nd lfence.

The first lfence has to wait for whatever instructions were previously in flight to retire from the out-of-order back-end (ROB = reorder buffer, 224 fused-domain uops on Skylake-family). If those included any loads that might miss in cache, your wait time can vary tremendously, and be much longer than you probably want.

Is it because CMC instructions back to back have no dependency on each other but CDQ instructions do have a dependency in between them?

You have that backwards: CMC has a true dependency on the previous CMC because it reads and writes the carry flag. Just like not eax has a true dependency on the previous EAX value.

CDQ does not: it reads EAX and writes EDX. Register renaming makes it possible for RDX to be written more than once in the same clock cycle. e.g. Zen can run 4 cdq instructions per clock. Your Coffee Lake can run 2 CDQ per clock (0.5c throughput), bottlenecked on the back-end ports it can run on (p0 and p6).

Agner Fog's numbers were based on testing a huge block of repeated instruction, apparently bottlenecking on legacy-decode throughput of 1/clock. (Again, see Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions? ). https://uops.info/ numbers are closer to accurate for small repeat counts for Coffee Lake, showing it as 0.6 c throughput. (But if you look at the detailed breakdown, with an unroll count of 500 https://www.uops.info/html-tp/CFL/CDQ-Measurements.html confirms that Coffee Lake still has that front-end bottleneck).

But increasing the repeat count up past about 20 (if aligned) will lead to the same legacy-decode bottleneck that Agner saw. However, if you don't use lfence, decode could be far ahead of execution so this is not good.

CDQ is a poor choice because of the weird front-end effects, and/or being a back-end throughput bottleneck instead of latency. But OoO exec can still see around it once the front-end gets past the repeated CDQs. 1-byte NOP could create a front-end bottleneck which might be more usable depending on what two things you were trying to separate.

BTW, if you don't fully understand dependency chains and their implications for out-of-order execution, and probably a bunch of other cpu-architecture details about the exact CPU you're using (e.g. store buffers if you want to separate any stores), you're going to have a bad time trying to do anything meaningful.

If you can do what you need with just a data dependency between two things, that might reduce the amount of stuff you need to understand to make anything like what you described as your goal.

Otherwise you probably need to understand basically all of this answer (and Agner Fog's microarchitecture guide) to figure out how your real problem translates into something you can actually make a CPU do. Or realize that it can't, and you'll need something else. (Like maybe a very fast in-order CPU, perhaps ARM, where you can somewhat control timing between independent instructions with delay sequences / loops.)

delays and measurement of specific instructions

Added Question

1 Answers

Maybe you can use a data dependency between your real work?

Re: your 2nd experiment with CMC between lfence OoO exec barriers