Why is execution time of a process shorter when another process shares the same HT core

Question

I have an Intel CPU with 4 HT cores (8 logical CPUs) and I built two simple processes.

The first one:

int main()
{
  for(int i=0;i<1000000;++i)
    for(int j=0;j<100000;++j);
}

The second one:

int main()
{
  while(1);
}

Both are compiled with gcc without special options. (I.e. with the default of -O0: no optimization debug mode, keeping variables in memory instead of registers.)

When I run the first one on the first logical CPU (CPU0), and when the other logical CPUs have a load charge near 0%, the execution time of this first process is:

real    2m42,625s
user    2m42,485s
sys     0m0,070s

However, when I run the second process (the infinite loop) on CPU4 (CPU0 and CPU4 are on the same core but not on the same hardware thread), the execution time of the first process is

real    2m25,412s
user    2m25,291s
sys     0m0,047s

I expected a longer time since there are two processes on the same core, instead of only one. But it is actually faster. Why does this happen?

EDIT: the P-states driver is intel_pstate. C-states are fixed by using processor.max_cstate=1 intel_idle.max_cstate=0. The frequency governor is set to performance (cpupower frequency-set -g performance) and turbo is disabled (cat /sys/devices/system/cpu/intel_pstate/no_turbo gives 1)

Yes. CPU frequency is fixed to performance governor of intel_pstate and turbo boost is disabled. I edit my message to add this important information — sebastien dontneedtoknowthat
I can confirm this happening on a i5-8250U, although not that extreme. I have about 2% improvement with the second program running on the same core, but no improvement when the second program runs on a different core. It must be an effect internal to the cpu, probably something to do with the details of pipelining. perf did not indicate any context switches and my kernel time was measured as 0,000s in each case. — walnut
I have to correct my confirmation from above: I compiled all programs as C++ (g++), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%. — walnut
@uneven_mark: oh weird, gcc -O0 and g++ -O0 do compile the loops differently, with the C version putting a cmp/jle at the bottom of the loop, right after a memory-destination add. (on the left pane in godbolt.org/z/Ik0Wkp). But the C++ version uses an if(break) style of looping with the condition at the top. Funny that separating the memory-destination add from the cmp reload by only one jmp instruction makes that big a difference. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2019-09-28T22:56:34

Both are compiled with gcc without special options. (I.e. with the default of -O0: no optimization debug mode, keeping variables in memory instead of registers.)

Unlike a normal program, the version with int i,j loop counters bottlenecks completely on store-forwarding latency, not front-end throughput or back-end execution resources or any shared resource.

This is why you never want to do real benchmarking with -O0 debug-mode: the bottlenecks are different than with normal optimization (-O2 at least, preferably -O3 -march=native).

On Intel Sandybridge-family (including @uneven_mark's Kaby Lake CPU), store-forwarding latency is lower if the reload doesn't try to run right away after the store, but instead runs a couple cycles later. Adding a redundant assignment speeds up code when compiled without optimization and also Loop with function call faster than an empty loop both demonstrate this effect in un-optimized compiler output.

Having another hyperthread competing for front-end bandwidth apparently makes this happen some of the time.

Or maybe the static partitioning of the store buffer speeds up store-forwarding? Might be interesting to try a minimally-invasive loop running on the other core, like this:

// compile this with optimization enabled
// and run it on the HT sibling of the debug-mode nested loop
#include  <immintrin.h>

int main(void) {
    while(1) {
      _mm_pause(); _mm_pause();
      _mm_pause(); _mm_pause();
    }
}

pause blocks for about 100 cycles on Skylake, up from about 5 on earlier CPUs.

So if the benefit to store-forwarding is in uops from the other thread having to issue/execute, this loop will do less of that and the run-time will be closer to when it has a physical core in single-thread mode.

But if the benefit is just from partitioning the ROB and store buffer (which could plausibly speed up the time for a load to probe it for stores), we'd still see the full benefit.

Update: @uneven_mark tested on Kaby Lake and found that this reduced the "speedup" to ~2%, down from ~8%. So apparently competing for front-end / back-end resources was an important part of the infinite loop in stopping the other loop from reloading too soon.

Perhaps using up BOB (branch-order-buffer) slots was the main mechanism in stopping the other thread's branch uops from issueing into the out-of-order back-end. Modern x86 CPUs snapshot the RAT and other backend state to allow fast recovery when they detect branch mispredicts, allowing rollback to the mispredicted branch without waiting for it to reach retirement.

This avoids waiting for independent work before the branch, and letting out-of-order execution of it continue while recovering. But it means fewer branches can be in flight. At least fewer conditional/indirect branches? IDK if a direct jmp would use a BOB entry; its validity is established during decode. So maybe this guess doesn't hold water.

The while(1){} loop has no local vars in the loop so it doesn't bottleneck on store-forwarding. It's just a top: jmp top loop that can run at 1 cycle per iteration. That's a single-uop instruction on Intel.

i5-8250U is a Kaby Lake, and (unlike Coffee Lake) still has its loop buffer (LSD) disabled by microcode like Skylake. So it can't unroll itself in the LSD/IDQ (queue feeding the issue/rename stage) and has to fetch the jmp uop separately from the uop cache every cycle. But the IDQ does buffer that, only needing an issue/rename cycle every 4 cycles to issue a group of 4 jmp uops for that logical core.

But anyway, on SKL / KBL these two threads together more than saturate uop cache fetch bandwidth and do compete with each other that way. On a CPU with the LSD (loopback buffer) enabled (e.g. Haswell / Broadwell, or Coffee Lake and later), they wouldn't. Sandybridge/Ivybridge don't unroll tiny loops to use more of their LSD so you'd have the same effect there. I'm not sure if that's significant. Testing on Haswell or Coffee Lake would be interesting.

(An unconditional jmp always ends a uop-cache line, and it's not a trace cache anyway so one uop-cache fetch can't give you more than one jmp uop.)

I have to correct my confirmation from above: I compiled all programs as C++ (g++), which gave the roughly 2% difference. If I compile everything as C, I get about 8%, which is closer to OPs roughly 10%.

That's interesting, gcc -O0 and g++ -O0 do compile the loops differently. This is a quirk of the GCC's C vs. C++ front-ends feeding GCC's back-end different GIMPLE/RTL, or something like that, and -O0 not making the back-end fix the inefficiency. This is not anything fundamental about C vs. C++ or that you could expect from other compilers.

The C version still transforms to an idiomatic do{}while() style loop with a cmp/jle at the bottom of the loop, right after a memory-destination add. (The left pane on this Godbolt compiler explorer link). Why are loops always compiled into "do...while" style (tail jump)?

But the C++ version uses an if(break) style of looping with the condition at the top, then the memory-destination add. Funny that separating the memory-destination add from the cmp reload by only one jmp instruction makes that big a difference.

# inner loop, gcc9.2 -O0.   (Actually g++ -xc but same difference)
        jmp     .L3
.L4:                                       # do {
        add     DWORD PTR [rbp-8], 1       #   j++
.L3:                                  # loop entry point for first iteration
        cmp     DWORD PTR [rbp-8], 99999
        jle     .L4                        # }while(j<=99999)

Apparently the add/cmp back to back make this version suffer more from slower store-forwarding on Skylake / Kaby/Coffee Lake

vs. this one which isn't affected as much:

# inner loop, g++9.2 -O0
.L4:                                      # do {
        cmp     DWORD PTR [rbp-8], 99999
        jg      .L3                         # if(j>99999) break
        add     DWORD PTR [rbp-8], 1        # j++
        jmp     .L4                       # while(1)
.L3:

cmp [mem], imm / jcc might still micro and/or macro-fuse, but I forget which. IDK if that's relevant, but if the loop is more uops it can't issue as fast. Still, with the execution bottleneck of 1 iteration per 5 or 6 cycles (memory-destination add latency), the front-end is easily going to stay ahead of the back-end even if it has to compete with another hyperthread.

Why is execution time of a process shorter when another process shares the same HT core

1 Answers