Both are compiled with gcc without special options. (I.e. with the default of -O0: no optimization debug mode, keeping variables in memory instead of registers.)
Unlike a normal program, the version with int i,j
loop counters bottlenecks completely on store-forwarding latency, not front-end throughput or back-end execution resources or any shared resource.
This is why you never want to do real benchmarking with -O0
debug-mode: the bottlenecks are different than with normal optimization (-O2
at least, preferably -O3 -march=native
).
On Intel Sandybridge-family (including @uneven_mark's Kaby Lake CPU), store-forwarding latency is lower if the reload doesn't try to run right away after the store, but instead runs a couple cycles later. Adding a redundant assignment speeds up code when compiled without optimization and also Loop with function call faster than an empty loop both demonstrate this effect in un-optimized compiler output.
Having another hyperthread competing for front-end bandwidth apparently makes this happen some of the time.
Or maybe the static partitioning of the store buffer speeds up store-forwarding? Might be interesting to try a minimally-invasive loop running on the other core, like this:
// compile this with optimization enabled
// and run it on the HT sibling of the debug-mode nested loop
#include <immintrin.h>
int main(void) {
while(1) {
_mm_pause(); _mm_pause();
_mm_pause(); _mm_pause();
}
}
pause
blocks for about 100 cycles on Skylake, up from about 5 on earlier CPUs.
So if the benefit to store-forwarding is in uops from the other thread having to issue/execute, this loop will do less of that and the run-time will be closer to when it has a physical core in single-thread mode.
But if the benefit is just from partitioning the ROB and store buffer (which could plausibly speed up the time for a load to probe it for stores), we'd still see the full benefit.
Update: @uneven_mark tested on Kaby Lake and found that this reduced the "speedup" to ~2%, down from ~8%. So apparently competing for front-end / back-end resources was an important part of the infinite loop in stopping the other loop from reloading too soon.
Perhaps using up BOB (branch-order-buffer) slots was the main mechanism in stopping the other thread's branch uops from issueing into the out-of-order back-end. Modern x86 CPUs snapshot the RAT and other backend state to allow fast recovery when they detect branch mispredicts, allowing rollback to the mispredicted branch without waiting for it to reach retirement.
This avoids waiting for independent work before the branch, and letting out-of-order execution of it continue while recovering. But it means fewer branches can be in flight. At least fewer conditional/indirect branches? IDK if a direct jmp
would use a BOB entry; its validity is established during decode. So maybe this guess doesn't hold water.
The while(1){}
loop has no local vars in the loop so it doesn't bottleneck on store-forwarding. It's just a top: jmp top
loop that can run at 1 cycle per iteration. That's a single-uop instruction on Intel.
i5-8250U is a Kaby Lake, and (unlike Coffee Lake) still has its loop buffer (LSD) disabled by microcode like Skylake. So it can't unroll itself in the LSD/IDQ (queue feeding the issue/rename stage) and has to fetch the jmp
uop separately from the uop cache every cycle. But the IDQ does buffer that, only needing an issue/rename cycle every 4 cycles to issue a group of 4 jmp uops for that logical core.
But anyway, on SKL / KBL these two threads together more than saturate uop cache fetch bandwidth and do compete with each other that way. On a CPU with the LSD (loopback buffer) enabled (e.g. Haswell / Broadwell, or Coffee Lake and later), they wouldn't. Sandybridge/Ivybridge don't unroll tiny loops to use more of their LSD so you'd have the same effect there. I'm not sure if that's significant. Testing on Haswell or Coffee Lake would be interesting.
(An unconditional jmp
always ends a uop-cache line, and it's not a trace cache anyway so one uop-cache fetch can't give you more than one jmp
uop.)
I have to correct my confirmation from above: I compiled all programs as C++ (g++), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%.
That's interesting, gcc -O0
and g++ -O0
do compile the loops differently. This is a quirk of the GCC's C vs. C++ front-ends feeding GCC's back-end different GIMPLE/RTL, or something like that, and -O0
not making the back-end fix the inefficiency. This is not anything fundamental about C vs. C++ or that you could expect from other compilers.
The C version still transforms to an idiomatic do{}while()
style loop with a cmp/jle
at the bottom of the loop, right after a memory-destination add. (The left pane on this Godbolt compiler explorer link). Why are loops always compiled into "do...while" style (tail jump)?
But the C++ version uses an if(break)
style of looping with the condition at the top, then the memory-destination add. Funny that separating the memory-destination add
from the cmp
reload by only one jmp
instruction makes that big a difference.
# inner loop, gcc9.2 -O0. (Actually g++ -xc but same difference)
jmp .L3
.L4: # do {
add DWORD PTR [rbp-8], 1 # j++
.L3: # loop entry point for first iteration
cmp DWORD PTR [rbp-8], 99999
jle .L4 # }while(j<=99999)
Apparently the add/cmp back to back make this version suffer more from slower store-forwarding on Skylake / Kaby/Coffee Lake
vs. this one which isn't affected as much:
# inner loop, g++9.2 -O0
.L4: # do {
cmp DWORD PTR [rbp-8], 99999
jg .L3 # if(j>99999) break
add DWORD PTR [rbp-8], 1 # j++
jmp .L4 # while(1)
.L3:
cmp [mem], imm
/ jcc might still micro and/or macro-fuse, but I forget which. IDK if that's relevant, but if the loop is more uops it can't issue as fast. Still, with the execution bottleneck of 1 iteration per 5 or 6 cycles (memory-destination add
latency), the front-end is easily going to stay ahead of the back-end even if it has to compete with another hyperthread.
performance
governor of intel_pstate and turbo boost is disabled. I edit my message to add this important information – sebastien dontneedtoknowthatperf
did not indicate any context switches and my kernel time was measured as 0,000s in each case. – walnutg++
), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%. – walnutgcc -O0
andg++ -O0
do compile the loops differently, with the C version putting acmp/jle
at the bottom of the loop, right after a memory-destination add. (on the left pane in godbolt.org/z/Ik0Wkp). But the C++ version uses anif(break)
style of looping with the condition at the top. Funny that separating the memory-destinationadd
from thecmp
reload by only onejmp
instruction makes that big a difference. – Peter Cordes