Tricks compiler uses to compile basic arithmetic operations of 128-bit integer

Question

I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:

typedef __int128_t int128_t;
typedef __uint128_t uint128_t;

uint128_t mul_to_128(uint64_t x, uint64_t y) {
  return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
  return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
  return x/y;
}

and I got:

mul_to_128(unsigned long, unsigned long):
        mov     rax, rdi
        mul     rsi
        ret
mul(unsigned __int128, unsigned __int128):
        imul    rsi, rdx
        mov     rax, rdi
        imul    rcx, rdi
        mul     rdx
        add     rcx, rsi
        add     rdx, rcx
        ret
div(unsigned __int128, unsigned __int128):
        sub     rsp, 8
        call    __udivti3 //what is this???
        add     rsp, 8
        ret

3 questions:

The 1st function(cast 64-bit uint to 128-bit then multiply them) are much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just 1 multiplication. If you multiply 2 maximums of 64-bit uint, it definitely overflows out of a 64-bit register...How does it produce 128-bit result by just 1 64-bit-64-bit multiplication???
I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says, hi as higher 4 bytes and lo as lower 4 bytes), and assemble the result like (hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently I was wrong...because it uses only 3 multiplications (2 of them are even imul...signed multiplication???why???). Can anyone tell me what gcc is thinking? And it is optimal?
Cannot even understand the assembly of the division...push stack -> call something called __udivti3 then pop stack...is __udivti3 something big?(like table look-up?) and what stuff does gcc try to push before the call?

the godbolt link: https://godbolt.org/g/sIIaM3

If you're going to be looking at assembly language output from the compiler you need to find a reference for it, as there is too much that goes on that is not obvious with just reading the output. — 1201ProgramAlarm
I recommend you to look up mul in the instruction set reference, then everything is clear. — fuz

Pete Becker Pete Becker · Accepted Answer · 2017-04-07T16:13:09

You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128 it will look for the results in the appropriate registers.

In the second example, think of the values as a1*2^64 + a0 and b1*2^64 + b0 (that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.

In the third example,__udivti3 is a function that does the division. It's not simple, so it doesn't get expanded inline.

Tricks compiler uses to compile basic arithmetic operations of 128-bit integer

2 Answers