While writing an optimized ftol
function I found some very odd behaviour in GCC 4.6.1
. Let me show you the code first (for clarity I marked the differences):
fast_trunc_one, C:
int fast_trunc_one(int i) {
int mantissa, exponent, sign, r;
mantissa = (i & 0x07fffff) | 0x800000;
exponent = 150 - ((i >> 23) & 0xff);
sign = i & 0x80000000;
if (exponent < 0) {
r = mantissa << -exponent; /* diff */
} else {
r = mantissa >> exponent; /* diff */
}
return (r ^ -sign) + sign; /* diff */
}
fast_trunc_two, C:
int fast_trunc_two(int i) {
int mantissa, exponent, sign, r;
mantissa = (i & 0x07fffff) | 0x800000;
exponent = 150 - ((i >> 23) & 0xff);
sign = i & 0x80000000;
if (exponent < 0) {
r = (mantissa << -exponent) ^ -sign; /* diff */
} else {
r = (mantissa >> exponent) ^ -sign; /* diff */
}
return r + sign; /* diff */
}
Seems the same right? Well GCC disagrees. After compiling with gcc -O3 -S -Wall -o test.s test.c
this is the assembly output:
fast_trunc_one, generated:
_fast_trunc_one:
LFB0:
.cfi_startproc
movl 4(%esp), %eax
movl $150, %ecx
movl %eax, %edx
andl $8388607, %edx
sarl $23, %eax
orl $8388608, %edx
andl $255, %eax
subl %eax, %ecx
movl %edx, %eax
sarl %cl, %eax
testl %ecx, %ecx
js L5
rep
ret
.p2align 4,,7
L5:
negl %ecx
movl %edx, %eax
sall %cl, %eax
ret
.cfi_endproc
fast_trunc_two, generated:
_fast_trunc_two:
LFB1:
.cfi_startproc
pushl %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
movl 8(%esp), %eax
movl $150, %ecx
movl %eax, %ebx
movl %eax, %edx
sarl $23, %ebx
andl $8388607, %edx
andl $255, %ebx
orl $8388608, %edx
andl $-2147483648, %eax
subl %ebx, %ecx
js L9
sarl %cl, %edx
movl %eax, %ecx
negl %ecx
xorl %ecx, %edx
addl %edx, %eax
popl %ebx
.cfi_remember_state
.cfi_def_cfa_offset 4
.cfi_restore 3
ret
.p2align 4,,7
L9:
.cfi_restore_state
negl %ecx
sall %cl, %edx
movl %eax, %ecx
negl %ecx
xorl %ecx, %edx
addl %edx, %eax
popl %ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
ret
.cfi_endproc
That's an extreme difference. This actually shows up on the profile too, fast_trunc_one
is around 30% faster than fast_trunc_two
. Now my question: what is causing this?
-S -O3 -da -fdump-tree-all
. This will create many snapshots of the intermediate representation. Walk through them (they're numbered) side by side and you should be able to find the missing optimization in the first case. – zwolint
tounsigned int
and see if the difference vanishes. – zwol(r + shifted) ^ sign
is not the same asr + (shifted ^ sign)
. I guess that's confusing the optimizer? FWIW, MSVC 2010 (16.00.40219.01) produces listings that are almost identical to each other: gist.github.com/2430454 – DCoder