I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
uint128_t mul_to_128(uint64_t x, uint64_t y) {
return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
return x/y;
}
and I got:
mul_to_128(unsigned long, unsigned long):
mov rax, rdi
mul rsi
ret
mul(unsigned __int128, unsigned __int128):
imul rsi, rdx
mov rax, rdi
imul rcx, rdi
mul rdx
add rcx, rsi
add rdx, rcx
ret
div(unsigned __int128, unsigned __int128):
sub rsp, 8
call __udivti3 //what is this???
add rsp, 8
ret
3 questions:
- The 1st function(cast
64-bituint to128-bitthen multiply them) are much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just 1 multiplication. If you multiply 2 maximums of 64-bit uint, it definitely overflows out of a 64-bit register...How does it produce 128-bit result by just 1 64-bit-64-bit multiplication??? - I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says,
hias higher 4 bytes andloas lower 4 bytes), and assemble the result like(hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently I was wrong...because it uses only 3 multiplications (2 of them are evenimul...signed multiplication???why???). Can anyone tell me what gcc is thinking? And it is optimal? - Cannot even understand the assembly of the division...push stack -> call something called
__udivti3then pop stack...is__udivti3something big?(like table look-up?) and what stuff does gcc try to push before the call?
the godbolt link: https://godbolt.org/g/sIIaM3
mulin the instruction set reference, then everything is clear. - fuz