GCC emits vastly different code using "-march=native" on similar architectures

Question

I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy.

I wrote the native code to take advantage of GCC's auto vectorizer, and it works. However, I've noticed that GCC has some odd behavior with the "-march=native" flag.

This is my loop:

#define BUFFER_SIZE_SQRT 4096
#define SQUARE(n) (n * n)

#define ROUNDS_PER_ITERATION 48

static float* cpu_result_matrix(const float* a, const float* b, const float* c)
{
    float* res = aligned_alloc(16, SQUARE(BUFFER_SIZE_SQRT) * sizeof(float));

    const unsigned buff_size = SQUARE(BUFFER_SIZE_SQRT);
    const unsigned round_cnt = ROUNDS_PER_ITERATION;

    float lres;
    for(unsigned i = 0; i < buff_size; i++)
    {
        lres = 0;
        for(unsigned j = 0; j < round_cnt; j++)
        {
            lres += a[i] * ((b[i] * c[i]) + b[i]);
            lres += b[i] * ((c[i] * a[i]) + c[i]);
            lres += c[i] * ((a[i] * b[i]) + a[i]);
        }

        res[i] = lres;
    }

    return res;
}

When I compile with "-march=native -Ofast" on a Broadwell system, I get nice vectorized AVX code.

.L19:
        vmovups ymm0, YMMWORD PTR [rcx+rdx]
        mov     eax, 48
        vmovups ymm2, YMMWORD PTR [rdi+rdx]
        vaddps  ymm1, ymm0, ymm5
        vmovups ymm3, YMMWORD PTR [rsi+rdx]
        vaddps  ymm4, ymm2, ymm5
        vmulps  ymm1, ymm1, ymm2
        vfmadd132ps     ymm4, ymm1, ymm0
        vaddps  ymm1, ymm3, ymm5
        vmulps  ymm0, ymm2, ymm0
        vmulps  ymm0, ymm0, ymm1
        vfmadd132ps     ymm4, ymm0, ymm3
        vmovaps ymm1, ymm4
        vxorps  xmm0, xmm0, xmm0
        .p2align 4,,10
        .p2align 3

Compiling with the same flags on a Piledriver system emits SSE2 instructions, but no AVX instructions, even though the architecture supports it. (I'll clarify my title here by saying that Broadwell and Piledriver are nothing alike, but they both support similar vector instruction set extensions, so the emitted code should be similar.)

.L19:
        mov     eax, 48
        movups  xmm0, XMMWORD PTR [rcx+rdx]
        movups  xmm2, XMMWORD PTR [r13+0+rdx]
        movaps  xmm4, xmm0
        movaps  xmm1, xmm2
        movups  xmm3, XMMWORD PTR [rsi+rdx]
        addps   xmm4, xmm5
        addps   xmm1, xmm5
        mulps   xmm4, xmm2
        mulps   xmm1, xmm0
        mulps   xmm0, xmm2
        addps   xmm1, xmm4
        movaps  xmm4, xmm1
        mulps   xmm4, xmm3
        addps   xmm3, xmm5
        mulps   xmm0, xmm3
        addps   xmm4, xmm0
        pxor    xmm0, xmm0
        movaps  xmm1, xmm4
        .p2align 4,,10
        .p2align 3

I can even compile the whole project with -march=broadwell, and run it on the Piledriver system, and it works, with a ~100% performance gain.

I'm compiling with GCC 5.1.0, and "-ftree-vectorizer-verbose" doesn't seem to work anymore, so the compiler's behavior is quite opaque. I haven't found any information about the flag being deprecated, so I'm not sure why it doesn't work anymore, and I'd really like to figure out what GCC is doing.

The whole project is here: https://github.com/jakogut/clperf/tree/v0.1

Did you try -v to see what -march=native expands to? See -fopt-info-... in the doc. — Marc Glisse
-ftree-vectorizer-verbose was replaced by -fopt-info-vec-* — Ross Ridge

Mysticial Mysticial · Accepted Answer · 2015-06-18T18:23:26

AVX is disabled because the entire AMD Bulldozer family does not handle 256-bit AVX instructions efficiently. Internally, the execution units are only 128-bit wide. So 256-bit operations are split up thereby providing no benefit over 128-bit.

To add insult to injury, on Piledriver, there's a bug in the 256-bit store that reduces the throughput to about 1 every 17 cycles.

Your test case seems to be an anomaly. You don't have 256-bit stores in that critical loop - which avoids the bug. This (theoretically) leaves SSE on par with AVX for Piledriver.

The tie-breaker comes from the FMA3 instructions which Piledriver supports. This is probably why the AVX loop does become faster on Piledriver.

One thing you can try is -mfma4 -mtune=bdver2 and see what happens.

GCC emits vastly different code using "-march=native" on similar architectures

2 Answers