Because modern processor makes use of heavy pipeline even for ALU, multiple executions of independent arithmetic operations can be executed in one cycle, for example, four add operations can be executed in 4 cycles not 4 * latency of one add.
Even the existence of the pipelines, and the presence of contention on the execution ports, I would like to implement cycle-accurate delays by executing some instructions in a way that time to execute a sequence of instructions is predictable. For example, if instruction x takes 2 cycles, and cannot be pipelined, then by executing x four-time, I expect that I can put 8 cycle delays.
I know that this can be usually impossible for userspace because the kernel can intervene in between execution sequence and could result in more delay then expectation. However, I assume that this code executes in the kernel side without interrupts or isolated core which is free from noise.
After taking a look at https://agner.org/optimize/instruction_tables.pdf, I found that CDQ instruction doesn't require memory operation and takes 1 cycle in its latency and reciprocal throughput. If I understand this correctly, this means that if there is no contention for the port used by CDQ, it can execute this instruction at every cycle. To test it, I put the CDQ in between RDTSC timer and set core frequency as nominal core frequency (with the hope that it is the same as TSC cycle). Also I pinned two processes to hyperthreaded cores; one falls in the while(1) loop and the other executes CDQ instruction. It seems that adding one instruction increases 1-2 TSC cycles.
However, I am concern about the case when it requires lots of CDQ instructions to put large delays such as 10000 which might require at least 5000 instructions. If the code size is too large to fit in the Instruction cache and cause cache miss and TLB miss, it might introduce some jitters in my delay. I've tried to use simple for loop to execute CDQ instructions, but cannot assure whether it is okay to use for loop (implemented with jnz,cmp, and sub) because it might also introduce some unexpected noise in my delay. Could anyone confirm if I can use the CDQ instruction in this way?
Added Question
After testing with multiple CMC instructions, it seems that 10 CMC instruction adds 10 TSC cycles. I used below code to measure time for executing 0, 10, 20, 30, 40, 50
asm volatile(
"lfence\t\n"
"rdtsc\t\n"
"lfence\t\n"
"mov %%eax, %%esi\t\n"
"cmc\n\t" // CMC * 10, 20, 30, 40, ...
"rdtscp\n\t"
"lfence\t\n"
"sub %%esi, %%eax\t\n"
:"=a"(*res)
:
: "ecx","edx","esi", "r11"
);
printf("elapsed time:%d\n", *res);
I got 44-46, 50-52, 62-64, 70-72, 80-82, 90-92 for (no CMC, 10CMC, 20CMC, 30CMC, 40CMC, 50CMC). When the RDTSC results varies 0~2 TSC cycles at every execution, it seems that 1CMC instruction maps to 1cycle latency. Except for the first time of adding 10 CMC (it doesn't increase 10 but 6~8), most of the time adding 10 more CMC instructions add (10 +-2)more TSC cylces. However, when I changed CMC to CDQ instruction as I originally used in the question, it seems that 1 CDQ instruction doesn't map to 1cycle in i9900K machine. However, when I look at the agner's optimization table, it seems that CMC and CDQ instruction is not different really. Does it because CMC instructions back to back have not a dependency on each other but CDQ instructions do have a dependency in between them?
Also if we consider the variable latency has been caused by the rdtsc not because of interrupt or other contention issues.. then it seems that CMC instruction can be used for delaying 1 core cycle right? Because I pinned my core to run at 3.6GHz clock frequency which assumed to be a TSC clock frequency on i9900k.. I did take a look at the referenced question but cannot catch the exact details..
cmc
(toggle CF) might do the trick. But that's barely useful for your overall goal of a delay loop. Possibly combined withlfence
, but that could delay for much longer than you want depending on how long any existing in-flight instructions take. e.g. a cache-miss load. – Peter Cordes