Fastest complex division in gcc versus ICC

Question

Consider this simple code:

#include <complex.h>
complex double f(complex double x, complex double y) {
  return x/y;
}

In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:

f:
        vmulsd  xmm4, xmm1, xmm3
        vmovapd xmm6, xmm0
        vmulsd  xmm5, xmm3, xmm3
        vmulsd  xmm6, xmm6, xmm3
        vfmadd231sd     xmm4, xmm0, xmm2
        vfmadd231sd     xmm5, xmm2, xmm2
        vfmsub132sd     xmm1, xmm6, xmm2
        vdivsd  xmm0, xmm4, xmm5
        vdivsd  xmm1, xmm1, xmm5
        ret

This makes sense and is easy to understand. However the Intel C Compiler gives:

f:
        fld1                                                    #3.12
        vmovsd    QWORD PTR [-24+rsp], xmm2                     #3.12
        fld       QWORD PTR [-24+rsp]                           #3.12
        vmovsd    QWORD PTR [-24+rsp], xmm3                     #3.12
        fld       st(0)                                         #3.12
        fmul      st, st(1)                                     #3.12
        fld       QWORD PTR [-24+rsp]                           #3.12
        fld       st(0)                                         #3.12
        fmul      st, st(1)                                     #3.12
        vmovsd    QWORD PTR [-24+rsp], xmm0                     #3.12
        faddp     st(2), st                                     #3.12
        fxch      st(1)                                         #3.12
        fdivp     st(3), st                                     #3.12
        fld       QWORD PTR [-24+rsp]                           #3.12
        vmovsd    QWORD PTR [-24+rsp], xmm1                     #3.12
        fld       st(0)                                         #3.12
        fmul      st, st(3)                                     #3.12
        fxch      st(1)                                         #3.12
        fmul      st, st(2)                                     #3.12
        fld       QWORD PTR [-24+rsp]                           #3.12
        fld       st(0)                                         #3.12
        fmulp     st(4), st                                     #3.12
        fxch      st(3)                                         #3.12
        faddp     st(2), st                                     #3.12
        fxch      st(1)                                         #3.12
        fmul      st, st(4)                                     #3.12
        fstp      QWORD PTR [-16+rsp]                           #3.12
        fxch      st(2)                                         #3.12
        fmulp     st(1), st                                     #3.12
        vmovsd    xmm0, QWORD PTR [-16+rsp]                     #3.12
        fsubrp    st(1), st                                     #3.12
        fmulp     st(1), st                                     #3.12
        fstp      QWORD PTR [-16+rsp]                           #3.12
        vmovsd    xmm1, QWORD PTR [-16+rsp]                     #3.12
        ret

Can anyone explain what it is doing and whether it is in fact faster than gcc's approach?

I can't benchmark the code myself as I don't have the ICC. The ICC assembly is created using https://godbolt.org/g/ZXZGy2 .

Can't you do a benchmark yourself? Call the function a million times, measuring each call using a high-precision timer and then take the average. — Some programmer dude
Why not ask the compiler vendor? Intel will be happy to improve their compiler. — too honest for this site
@Olaf You mean contact them to ask them to perform benchmarks to report if their assembly is faster than gcc's? I am not sure they would answer that. — eleanora
Interestingly, there is a single fdivp in Intel's code, which could be beneficial, as divisions are costly. — Yves Daoust

Pyves Pyves · Accepted Answer · 2017-06-20T22:04:12

As requested by the question and some comments, I ran a quick benchmark to compare the performance of the GCC and ICC compilers on this bit of C code.

Hardware setup

The machine that was used to run the tests features an AMD A8-5550M APU quad-core processor, with a frequency of 2.1 GHz. Caches sizes are 16k for L1i, 64k for L1d and 2048K for L2.

Experimental setup

I don't own a copy of the ICC compiler, so the assembly code listed in the question was directly used for this benchmark. The two assembly outputs were compiled using the NASM assembler. Some minor syntactic changes were required to make the ICC version compatible, but of course nothing changing the functionality or affecting the performance in any way. A small C wrapper was written to call the two assembly functions and monitor timings.

Here is a version of the code similar to the one that was used for this simple benchmark:

#include <stdio.h> 
#include <complex.h>
#include <time.h>

extern complex double gcc_f(complex double x, complex double y);
extern complex double icc_f(complex double x, complex double y);

int main() {
    struct timespec stop, start;
    complex double z1 = 1.0654575 + 3.0678788768 * I;
    complex double z2 = 2.225 - 8.0 * I;

    clock_gettime(CLOCK_MONOTONIC_RAW, &start);
    for(int i =0; i < 1000000000; ++i) {
        icc_f(z1, z2);
        // gcc_f(z1, z2);
    }
    clock_gettime(CLOCK_MONOTONIC_RAW, &stop);

    printf("Execution took %luns\n", ((stop.tv_sec - start.tv_sec) * 1000000000 + (stop.tv_nsec - start.tv_nsec)));
    return 0;
}

Results

Both timings were averaged on a billion executions.

The GCC version took on average 8.8ns per execution.

The ICC version took on average 17.3ns per execution.

Therefore, the GCC compiler outperforms the ICC compiler by a significant margin, at least with the particular hardware setup described above. GCC seems to make a more clever usage of the AVX instruction set in this case.

As a side note, quite interestingly, if you compile with -Ofast instead of -O3, the ICC version looks more similar to the GCC version:

f:
        vunpcklpd xmm4, xmm2, xmm3                              #2.54
        vunpcklpd xmm6, xmm0, xmm1                              #2.54
        vunpckhpd xmm5, xmm4, xmm4                              #3.12
        vmulpd    xmm10, xmm4, xmm4                             #3.12
        vmulpd    xmm8, xmm5, xmm6                              #3.12
        vmovddup  xmm9, xmm4                                    #3.12
        vshufpd   xmm7, xmm6, xmm6, 1                           #3.12
        vshufpd   xmm11, xmm10, xmm10, 1                        #3.12
        vfmaddsub213pd xmm9, xmm7, xmm8                         #3.12
        vaddpd    xmm13, xmm10, xmm11                           #3.12
        vshufpd   xmm12, xmm9, xmm9, 1                          #3.12
        vdivpd    xmm0, xmm12, xmm13                            #3.12
        vunpckhpd xmm1, xmm0, xmm0                              #3.12
        ret

This alternative ICC version is significantly faster, on average 9.0ns per execution, but is still slightly behind the GCC version. Nevertheless, such small differences are probably tied to the experimental setup.

Fastest complex division in gcc versus ICC

2 Answers