Why does GCC generate such radically different assembly for nearly the same C code?

Question

While writing an optimized ftol function I found some very odd behaviour in GCC 4.6.1. Let me show you the code first (for clarity I marked the differences):

fast_trunc_one, C:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = mantissa << -exponent;                       /* diff */
    } else {
        r = mantissa >> exponent;                        /* diff */
    }

    return (r ^ -sign) + sign;                           /* diff */
}

fast_trunc_two, C:

int fast_trunc_two(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = (mantissa << -exponent) ^ -sign;             /* diff */
    } else {
        r = (mantissa >> exponent) ^ -sign;              /* diff */
    }

    return r + sign;                                     /* diff */
}

Seems the same right? Well GCC disagrees. After compiling with gcc -O3 -S -Wall -o test.s test.c this is the assembly output:

fast_trunc_one, generated:

_fast_trunc_one:
LFB0:
    .cfi_startproc
    movl    4(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %edx
    andl    $8388607, %edx
    sarl    $23, %eax
    orl $8388608, %edx
    andl    $255, %eax
    subl    %eax, %ecx
    movl    %edx, %eax
    sarl    %cl, %eax
    testl   %ecx, %ecx
    js  L5
    rep
    ret
    .p2align 4,,7
L5:
    negl    %ecx
    movl    %edx, %eax
    sall    %cl, %eax
    ret
    .cfi_endproc

fast_trunc_two, generated:

_fast_trunc_two:
LFB1:
    .cfi_startproc
    pushl   %ebx
    .cfi_def_cfa_offset 8
    .cfi_offset 3, -8
    movl    8(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %ebx
    movl    %eax, %edx
    sarl    $23, %ebx
    andl    $8388607, %edx
    andl    $255, %ebx
    orl $8388608, %edx
    andl    $-2147483648, %eax
    subl    %ebx, %ecx
    js  L9
    sarl    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_remember_state
    .cfi_def_cfa_offset 4
    .cfi_restore 3
    ret
    .p2align 4,,7
L9:
    .cfi_restore_state
    negl    %ecx
    sall    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_restore 3
    .cfi_def_cfa_offset 4
    ret
    .cfi_endproc

That's an extreme difference. This actually shows up on the profile too, fast_trunc_one is around 30% faster than fast_trunc_two. Now my question: what is causing this?

For testing purposes I created a gist here where you can easily copy/paste the source and see if you can reproduce the bug on other systems/versions of GCC. — orlp
Put the test cases in a directory of their own. Compile them with -S -O3 -da -fdump-tree-all. This will create many snapshots of the intermediate representation. Walk through them (they're numbered) side by side and you should be able to find the missing optimization in the first case. — zwol
Suggestion two: change all int to unsigned int and see if the difference vanishes. — zwol
The two functions seem to be doing slightly different math. While the results might be the same, the expression (r + shifted) ^ sign is not the same as r + (shifted ^ sign). I guess that's confusing the optimizer? FWIW, MSVC 2010 (16.00.40219.01) produces listings that are almost identical to each other: gist.github.com/2430454 — DCoder
@DCoder: Oh damn! I didn't spot that. It isn't the explanation for the difference though. Let me update the question with a new version where this is ruled out. — orlp

Mysticial Mysticial · Accepted Answer · 2012-04-20T17:52:51

Updated to sync with the OP's edit

By tinkering with the code, I've managed to see how GCC optimizes the first case.

Before we can understand why they are so different, first we must understand how GCC optimizes fast_trunc_one().

Believe it or not, fast_trunc_one() is being optimized to this:

int fast_trunc_one(int i) {
    int mantissa, exponent;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);

    if (exponent < 0) {
        return (mantissa << -exponent);             /* diff */
    } else {
        return (mantissa >> exponent);              /* diff */
    }
}

This produces the exact same assembly as the original fast_trunc_one() - register names and everything.

Notice that there are no xors in the assembly for fast_trunc_one(). That's what gave it away for me.

How so?

Step 1: sign = -sign

First, let's take a look at the sign variable. Since sign = i & 0x80000000;, there are only two possible values that sign can take:

sign = 0
sign = 0x80000000

Now recognize that in both cases, sign == -sign. Therefore, when I change the original code to this:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = mantissa << -exponent;
    } else {
        r = mantissa >> exponent;
    }

    return (r ^ sign) + sign;
}

It produces the exact same assembly as the original fast_trunc_one(). I'll spare you the assembly, but it is identical - register names and all.

Step 2: Mathematical reduction: x + (y ^ x) = y

sign can only take one of two values, 0 or 0x80000000.

When x = 0, then x + (y ^ x) = y then trivial holds.
Adding and xoring by 0x80000000 is the same. It flips the sign bit. Therefore x + (y ^ x) = y also holds when x = 0x80000000.

Therefore, x + (y ^ x) reduces to y. And the code simplifies to this:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = (mantissa << -exponent);
    } else {
        r = (mantissa >> exponent);
    }

    return r;
}

Again, this compiles to the exact same assembly - register names and all.

This above version finally reduces to this:

int fast_trunc_one(int i) {
    int mantissa, exponent;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);

    if (exponent < 0) {
        return (mantissa << -exponent);             /* diff */
    } else {
        return (mantissa >> exponent);              /* diff */
    }
}

which is pretty much exactly what GCC generates in the assembly.

So why doesn't the compiler optimize fast_trunc_two() to the same thing?

The key part in fast_trunc_one() is the x + (y ^ x) = y optimization. In fast_trunc_two() the x + (y ^ x) expression is being split across the branch.

I suspect that might be enough to confuse GCC to not make this optimization. (It would need to hoist the ^ -sign out of the branch and merge it into the r + sign at the end.)

For example, this produces the same assembly as fast_trunc_one():

int fast_trunc_two(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = ((mantissa << -exponent) ^ -sign) + sign;             /* diff */
    } else {
        r = ((mantissa >> exponent) ^ -sign) + sign;              /* diff */
    }

    return r;                                     /* diff */
}

Why does GCC generate such radically different assembly for nearly the same C code?

3 Answers

How so?