42
votes

This is my test code:

#include <chrono>
#include <iostream>
#include <cstdlib>
using namespace std;

using ll = long long;

int main()
{
    __int128_t a, b;
    ll x, y;

    a = rand() + 10000000;
    b = rand() % 50000;
    auto t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        a += b;
        a /= b;
        b *= a;
        b -= a;
        a %= b;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)a % 100000 << '\n';

    x = rand() + 10000000;
    y = rand() % 50000;
    t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        x += y;
        x /= y;
        y *= x;
        y -= x;
        x %= y;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)x % 100000 << '\n';

    return 0;
}

This is the test result:

$ g++ main.cpp -o main -O2
$ ./main
2432 1
2627 1

Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long.

int and double are both significantly faster than long long; long long has become the slowest type.

How does this happen?

2
I think it is irrelevant to long long. If you define x and y as __int128_t you will also get such difference godbolt.org/z/1e1YeE - Alex Lop.
To what degree can out of order execution affect the results here? At a glance the two tests look completely independent of each other in which case isn't the processor free to execute them out of order? Asking to test my potentially naive understanding of the subject. - Rich
@Rich OOO won't execute two loops in parallel and probably because of the dependencies inside the loop code the OOO won't be very efficient here. - Alex Lop.
@Rich: hardware OoO exec only works over short distances, where "short" is at most about 224 instructions on Skylake (the ROB size: blog.stuffedcow.net/2013/05/measuring-rob-capacity). And that's measured along the path of execution, where each trip through the loop runs the loop body. See my answer here. Fusing the two loops would only be theoretically possible for an unconventional CPU like Transmeta Crusoe that internally does dynamic recompilation, not for current CPUs that look at instructions in execution order. - Peter Cordes
But yeah, this craptastic benchmark doesn't do any warm-ups, so the only thing that saves it from CPU frequency and other warm-up effects totally throwing it off is that it runs a lot of iterations so that's a drop in the bucket. Idiomatic way of performance evaluation?. Also, it puts huge emphasis on division performance by doing it as much as other operations. Very unrealistic for most use-cases. - Peter Cordes

2 Answers

38
votes

The performance difference comes from the efficiency of 128-bit divisions/modulus with GCC/Clang in this specific case.

Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).

If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:

  • taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
  • the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A div instruction is still used in most possible paths (especially in this case);
  • The execution time of the div/idiv instructions covers most of the overall execution time because of their very high latencies. The div/idiv instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div lower than a idiv making the former faster.

Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction). Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).

28
votes

TL:DR: __int128 division helper functions internally end up doing an unsigned div reg64 (after some branching on the values being positive and the upper halves being 0). 64-bit div is faster on Intel CPUs than the signed idiv reg64 that GCC inlines for signed long long. Faster by enough to make up for all the extra overhead of the helper function, and extended-precision for the other operations.

You probably wouldn't see this effect on AMD CPUs: long long would be faster as expected because idiv r64 is similar enough in perf to div r64 there.

And unsigned long long is faster than unsigned __int128 even on Intel CPUs, for example on my i7-6700k (Skylake) at 3.9GHz (run under perf stat to be sure of CPU frequency during the test):

  • 2097 (i128) vs. 2332 (i64) - your original test (run back-to-back for CPU freq warm-up)
  • 2075 (u128) vs. 1900 (u64) - unsigned versions. Slightly less branching in u128 division vs. i128, but major difference for i64 vs. u64 where the only difference is div vs. idiv.

Also, drawing any general conclusions from a very specific micro-benchmark like this would be a bad idea. It is interesting to dig into why exactly the extended-precision __int128 type manages to be faster in this division benchmark with positive numbers small enough to fit in a 32-bit integer, though.


Your benchmark is heavily weighted towards division, which you do twice per iteration (/ and %), even though it's much more expensive than other operations and in most code used much less often. (e.g. sum a whole array then divide once to get the average.)

Your benchmark is also has no instruction-level parallelism: each step has a data dependency on the previous step. This prevents auto-vectorization or anything that would show some of the advantages of narrower types.

(It's also not careful to avoid warm-up effects like the first timed region being slow until the CPU gets up to max turbo. Idiomatic way of performance evaluation?. But that happens much faster than the couple seconds of your timed regions, so that's not a problem here.)

128-bit integer division (especially signed) is too complicated for GCC to want to inline, so gcc emits a call to a helper function, __divti3 or __modti3. (TI = tetra-integer, GCC's internal name for an integer that's 4x the size of int.) These functions are documented in the GCC-internals manual.

You can see the compiler-generated asm on the Godbolt compiler-explorer. i.e. 128-bit addition with add/adc, multiplication with one mul full-multiply of the low halves, and 2x non-widening imul of the cross products. Yes these are slower than the single-instruction equivalents for int64_t.

But Godbolt doesn't show you the asm for libgcc helper functions. It doesn't disassemble them even in "compile-to-binary" and disassemble mode (instead of the usual compiler asm text output) because it dynamically links libgcc_s instead of libgcc.a.

Extended-precision signed division is done by negating if necessary and doing unsigned division of 64-bit chunks, then fixing up the sign of the result if necessary.

With both inputs small and positive, no actual negation is needed (just testing and branching). There are also fast-paths for small numbers (high-half divisor = 0, and quotient will fit in 64 bits), which is the case here. The end result is that the path of execution through __divti3 looks like this:

This is from manually single-stepping into the call to __divti3 with gdb, after compiling with g++ -g -O3 int128-bench.cpp -o int128-bench.O3 on my Arch GNU/Linux system, with gcc-libs 10.1.0-2.

# Inputs: dividend = RSI:RDI, divisor = RCX:RDX
# returns signed quotient RDX:RAX
|  >0x7ffff7c4fd40 <__divti3>       endbr64             # in case caller was using CFE (control-flow enforcement), apparently this instruction has to pollute all library functions now.  I assume it's cheap at least in the no-CFE case.
│   0x7ffff7c4fd44 <__divti3+4>     push   r12
│   0x7ffff7c4fd46 <__divti3+6>     mov    r11,rdi
│   0x7ffff7c4fd49 <__divti3+9>     mov    rax,rdx                                                                                                       │   0x7ffff7c4fd4c <__divti3+12>    xor    edi,edi
│   0x7ffff7c4fd4e <__divti3+14>    push   rbx
│   0x7ffff7c4fd4f <__divti3+15>    mov    rdx,rcx
│   0x7ffff7c4fd52 <__divti3+18>    test   rsi,rsi      # check sign bit of dividend (and jump over a negation)
│   0x7ffff7c4fd55 <__divti3+21>    jns    0x7ffff7c4fd6e <__divti3+46>
... taken branch to
|  >0x7ffff7c4fd6e <__divti3+46>    mov    r10,rdx
│   0x7ffff7c4fd71 <__divti3+49>    test   rdx,rdx      # check sign bit of divisor (and jump over a negation), note there was a mov rdx,rcx earlier
│   0x7ffff7c4fd74 <__divti3+52>    jns    0x7ffff7c4fd86 <__divti3+70>
... taken branch to
│  >0x7ffff7c4fd86 <__divti3+70>    mov    r9,rax
│   0x7ffff7c4fd89 <__divti3+73>    mov    r8,r11
│   0x7ffff7c4fd8c <__divti3+76>    test   r10,r10      # check high half of abs(divisor) for being non-zero
│   0x7ffff7c4fd8f <__divti3+79>    jne    0x7ffff7c4fdb0 <__divti3+112>  # falls through: small-number fast path
│   0x7ffff7c4fd91 <__divti3+81>    cmp    rax,rsi      # check that quotient will fit in 64 bits so 128b/64b single div won't fault: jump if (divisor <= high half of dividend)
│   0x7ffff7c4fd94 <__divti3+84>    jbe    0x7ffff7c4fe00 <__divti3+192>  # falls through: small-number fast path
│   0x7ffff7c4fd96 <__divti3+86>    mov    rdx,rsi
│   0x7ffff7c4fd99 <__divti3+89>    mov    rax,r11
│   0x7ffff7c4fd9c <__divti3+92>    xor    esi,esi
│  >0x7ffff7c4fd9e <__divti3+94>    div    r9                #### Do the actual division ###
│   0x7ffff7c4fda1 <__divti3+97>    mov    rcx,rax
│   0x7ffff7c4fda4 <__divti3+100>   jmp    0x7ffff7c4fdb9 <__divti3+121>
...taken branch to
│  >0x7ffff7c4fdb9 <__divti3+121>   mov    rax,rcx
│   0x7ffff7c4fdbc <__divti3+124>   mov    rdx,rsi
│   0x7ffff7c4fdbf <__divti3+127>   test   rdi,rdi     # check if the result should be negative
│   0x7ffff7c4fdc2 <__divti3+130>   je     0x7ffff7c4fdce <__divti3+142>
... taken branch over a neg rax / adc rax,0 / neg rdx
│  >0x7ffff7c4fdce <__divti3+142>   pop    rbx
│   0x7ffff7c4fdcf <__divti3+143>   pop    r12
│   0x7ffff7c4fdd1 <__divti3+145>   ret
... return back to the loop body that called it

Intel CPUs (since IvyBridge) have zero-latency mov, so all that overhead doesn't worsen the critical path latency (which is your bottleneck) significantly. Or at least not enough to make up the difference between idiv and div.

The branching is handled by branch prediction and speculative execution, only checking the predictions after the fact when the actual input register values are the same. The branching goes the same way every time so it's trivial for branch prediction to learn. Since division is so slow, there's plenty of time for out-of-order exec to catch up.

64-bit operand-size integer division is very slow on Intel CPUs, even when the numbers are actually small and would fit in a 32-bit integer, and the extra microcode for signed integer division is even more expensive.

e.g. on my Skylake (i7-6700k), https://uops.info/ shows that (table search result )

  • idiv r64 is 56 uops for the front-end, with latency from 41 to 95 cycles (from divisor to quotient, which is the relevant case here I think).
  • div r64 is 33 uops for the front-end, with latency from 35 to 87 cycles. (for that same latency path).

The latency best-case happens for small quotients or small dividends or something, I can never remember which.

Similar to the branching that GCC does in software for 128-bit division in terms of 64-bit, I think the CPU microcode is internally doing 64-bit division in terms of narrower operations, probably the 32-bit that's only 10 uops for signed or unsigned, with much lower latency. (Ice Lake improves the divider so 64-bit division isn't much slower than 32-bit.)

This is why you found long long so much slower than int for this benchmark. In a lot of cases it's about the same, or half speed if memory bandwidth or SIMD are involved. (Only 2 elements per 128-bit of vector width, not 4).

AMD CPUs handle 64-bit operand size more efficiently, with performance only depending on the actual values, so about the same for div r32 vs. div r64 with the same numbers.

BTW, the actual values tend to be something like a=1814246614 / b=1814246613 = 1, then a=1 % b=1814246612 (with b decreasing by 1 every iteration). Only ever testing division with quotient=1 seems very silly. (The first iteration might be different, but we get into this state for the 2nd and later.)

Performance of integer operations other than division is not data-dependent on modern CPUs. (Unless of course there are compile-time constants that allow different asm to be emitted. Like division by a constant is much cheaper when done with a multiplicative inverse calculated at compile time.)

re: double: see Floating point division vs floating point multiplication for division vs. multiplication. FP division is often harder to avoid, and its performance is relevant in more cases, so it's handled better.


Related: