Why is this C++ wrapper class not being inlined away?

Question

EDIT - something's up with my build system. I'm still figuring out exactly what, but gcc was producing weird results (even though it's a .cpp file), but once I used g++ then it worked as expected.

This is a very reduced test-case for something I've been having trouble with, where using a numerical wrapper class (which I thought would be inlined away) made my program 10x slower.

This is independent of optimisation level (tried with -O0 and -O3).

Am I missing some detail in my wrapper class?

C++

I have the following program, in which I define a class which wraps a double and provides the + operator:

#include <cstdio>
#include <cstdlib>

#define INLINE __attribute__((always_inline)) inline

struct alignas(8) WrappedDouble {
    double value;

    INLINE friend const WrappedDouble operator+(const WrappedDouble& left, const WrappedDouble& right) {
        return {left.value + right.value};
    };
};

#define doubleType WrappedDouble // either "double" or "WrappedDouble"

int main() {
    int N = 100000000;
    doubleType* arr = (doubleType*)malloc(sizeof(doubleType)*N);
    for (int i = 1; i < N; i++) {
        arr[i] = arr[i - 1] + arr[i];
    }

    free(arr);
    printf("done\n");

    return 0;
}

I thought that this would compile to the same thing - it's doing the same calculations, and everything is inlined.

However, it's not - it produces a larger and slower result, regardless of optimisation level.

(This particular result is not significantly slower, but my actual use-case includes more arithmetic.)

EDIT - I am aware that this isn't constructing my array elements. I thought this might produce less ASM so I could understand it better, but I can change it if it's a problem.

EDIT - I am also aware that I should be using new[]/delete[]. Unfortunately gcc refused to compile that, even though it was in a .cpp file. This was a symptom of my build system being screwed up, which is probably my actual problem.

EDIT - If I use g++ instead of gcc, it produces identical output.

EDIT - I posted the wrong version of the ASM (-O0 instead of -O3), so this section isn't helpful.

Assembly

I'm using XCode's gcc on my Mac, on a 64-bit system. The result is the same, aside from the body of the for-loop.

Here's what it produces for the body of the loop if doubleType is double:

movq    -16(%rbp), %rax
movl    -20(%rbp), %ecx
subl    $1, %ecx
movslq  %ecx, %rdx
movsd   (%rax,%rdx,8), %xmm0    ## xmm0 = mem[0],zero
movq    -16(%rbp), %rax
movslq  -20(%rbp), %rdx
addsd   (%rax,%rdx,8), %xmm0
movq    -16(%rbp), %rax
movslq  -20(%rbp), %rdx
movsd   %xmm0, (%rax,%rdx,8)

The WrappedDouble version is much longer:

movq    -40(%rbp), %rax
movl    -44(%rbp), %ecx
subl    $1, %ecx
movslq  %ecx, %rdx
shlq    $3, %rdx
addq    %rdx, %rax
movq    -40(%rbp), %rdx
movslq  -44(%rbp), %rsi
shlq    $3, %rsi
addq    %rsi, %rdx
movq    %rax, -16(%rbp)
movq    %rdx, -24(%rbp)
movq    -16(%rbp), %rax
movsd   (%rax), %xmm0           ## xmm0 = mem[0],zero
movq    -24(%rbp), %rax
addsd   (%rax), %xmm0
movsd   %xmm0, -8(%rbp)
movsd   -8(%rbp), %xmm0         ## xmm0 = mem[0],zero
movsd   %xmm0, -56(%rbp)
movq    -40(%rbp), %rax
movslq  -44(%rbp), %rdx
movq    -56(%rbp), %rsi
movq    %rsi, (%rax,%rdx,8)

You should almost never use malloc in C++. It only allocates memory, but it doesn't construct objects. And almost never use new[] to allocate arrays, use std::vector instead. — Some programmer dude
Using a vector and the -O2 flag, using a double of your class compiles to the same code (with GCC 8.2). Note that removing the INLINE macro or using "proper" type-aliases didn't change anything. — Some programmer dude
IIRC, gcc on a .cpp file will compile it as C++, but since you used the gcc front-end it won't link the C++ standard library. So you'll get a link error if you use new instead of malloc. There's no good reason to ever use gcc on C++ code AFAIK, that's just what happens if you do so by accident. Of course you probably have a gcc that's actually Apple clang, but probably the behaviour is the same. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2019-01-07T12:31:05

It is inlined, but not optimized away because you compiled with -O0 (the default). That generates asm for consistent debugging, allowing you to modify any C++ variable while stopped at a breakpoint on any line.

This means the compiler spills everything from registers after every statement, and reloads what it needs for the next. So more statements to express the same logic = slower code, whether they're in the same function or not. Why does clang produce inefficient asm for this simple floating point sum (with -O0)? explains in more detail.

Normally -O0 won't inline functions, but it does respect __attribute__((always_inline)).

C loop optimization help for final assignment explains why benchmarking or tuning with -O0 is totally pointless. Both versions are ridiculous garbage for performance.

If it wasn't inlined, there'd be a call instruction that called it inside the loop.

The asm is actually creating the pointers in registers for const WrappedDouble& left and right. (very inefficiently, using multiple instructions instead of one lea. The addq %rdx, %rax is the final step in one of those.)

Then it spills those pointer args to stack memory, because they're real variables and have to be in memory where a debugger could modify them. That's what movq %rax, -16(%rbp) and %rdx ... is doing.

After reloading and dereferencing those pointers, the addsd (add scalar double) result is itself spilled back to a local in stack memory with movsd %xmm0, -8(%rbp). This isn't a named variable, it's the return value of the function.

It's then reloaded and copied again to another stack location, then finally arr and i are loaded from the stack, along with the double result of operator+, and that's stored into arr[i] with movq %rsi, (%rax,%rdx,8). (Yes, LLVM used a 64-bit integer mov to copy a double that time. The earlier times used SSE2 movsd.)

All of those copies of the return value are on the critical path for the loop-carried dependency chain, because the next iteration reads arr[i-1]. Those ~5 or 6 cycle store-forwarding latencies really add up vs. 3 or 4 cycle FP add latency.

Obviously that's massively inefficient. With optimization enabled, gcc and clang have no trouble inlining and optimizing away your wrapper.

They also optimize by keeping around the arr[i] result in a register for use as the arr[i-1] result in the next iteration. This avoids the ~6 cycle store-forwarding latency that would otherwise be inside the loop, if it made asm like the source.

i.e. the optimized asm looks kind of like this C++:

double tmp = arr[0];   // kept in XMM0

for(...) {
   tmp += arr[i];   // no re-read of mmeory
   arr[i] = tmp;
}

Amusingly, clang doesn't bother to initialize its tmp (xmm0) before the loop, because you don't bother to initialize the array. Strange it doesn't warn about UB. In practice a big malloc with glibc's implementation will give you fresh pages from the OS, and they will all hold zeros, i.e. 0.0. But clang will give you whatever was left around in XMM0! If you add a ((double*)arr)[0] = 1;, clang will load the first element before the loop.

Unfortunately the compiler doesn't know how to do any better than that for your Prefix Sum calculation. See parallel prefix (cumulative) sum with SSE and SIMD prefix sum on Intel cpu for ways to speed this up by another factor of maybe 2, and/or parallelize it.

I prefer Intel syntax, but the Godbolt compiler explorer can give you AT&T syntax like in your question if you like.

# gcc8.2 -O3 -march=haswell -Wall
.LC1:
    .string "done"
main:
    sub     rsp, 8
    mov     edi, 800000000
    call    malloc                  # return value in RAX

    vmovsd  xmm0, QWORD PTR [rax]   # load first elmeent
    lea     rdx, [rax+8]            # p = &arr[1]
    lea     rcx, [rax+800000000]    # endp = arr + len

.L2:                                   # do {
    vaddsd  xmm0, xmm0, QWORD PTR [rdx]   # tmp += *p
    add     rdx, 8                        # p++
    vmovsd  QWORD PTR [rdx-8], xmm0       # p[-1] = tmp
    cmp     rdx, rcx
    jne     .L2                        # }while(p != endp);

    mov     rdi, rax
    call    free
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    xor     eax, eax
    add     rsp, 8
    ret

Clang unrolls a bit, and like I said doesn't bother to init its tmp.

# just the inner loop from clang -O3
# with -march=haswell it unrolls a lot more, so I left that out.
# hence the 2-operand SSE2 addsd instead of 3-operand AVX vaddsd
.LBB0_1:                                # do {
    addsd   xmm0, qword ptr [rax + 8*rcx - 16]
    movsd   qword ptr [rax + 8*rcx - 16], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx - 8]
    movsd   qword ptr [rax + 8*rcx - 8], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx]
    movsd   qword ptr [rax + 8*rcx], xmm0
    add     rcx, 3                            # i += 3
    cmp     rcx, 100000002
    jne     .LBB0_1                      } while(i!=100000002)

Apple XCode's gcc is really clang/LLVM in disguise, on modern OS X systems.

Why is this C++ wrapper class not being inlined away?

C++

Assembly

3 Answers