Consider this simple code:
#include <complex.h>
complex double f(complex double x, complex double y) {
return x/y;
}
In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:
f:
vmulsd xmm4, xmm1, xmm3
vmovapd xmm6, xmm0
vmulsd xmm5, xmm3, xmm3
vmulsd xmm6, xmm6, xmm3
vfmadd231sd xmm4, xmm0, xmm2
vfmadd231sd xmm5, xmm2, xmm2
vfmsub132sd xmm1, xmm6, xmm2
vdivsd xmm0, xmm4, xmm5
vdivsd xmm1, xmm1, xmm5
ret
This makes sense and is easy to understand. However the Intel C Compiler gives:
f:
fld1 #3.12
vmovsd QWORD PTR [-24+rsp], xmm2 #3.12
fld QWORD PTR [-24+rsp] #3.12
vmovsd QWORD PTR [-24+rsp], xmm3 #3.12
fld st(0) #3.12
fmul st, st(1) #3.12
fld QWORD PTR [-24+rsp] #3.12
fld st(0) #3.12
fmul st, st(1) #3.12
vmovsd QWORD PTR [-24+rsp], xmm0 #3.12
faddp st(2), st #3.12
fxch st(1) #3.12
fdivp st(3), st #3.12
fld QWORD PTR [-24+rsp] #3.12
vmovsd QWORD PTR [-24+rsp], xmm1 #3.12
fld st(0) #3.12
fmul st, st(3) #3.12
fxch st(1) #3.12
fmul st, st(2) #3.12
fld QWORD PTR [-24+rsp] #3.12
fld st(0) #3.12
fmulp st(4), st #3.12
fxch st(3) #3.12
faddp st(2), st #3.12
fxch st(1) #3.12
fmul st, st(4) #3.12
fstp QWORD PTR [-16+rsp] #3.12
fxch st(2) #3.12
fmulp st(1), st #3.12
vmovsd xmm0, QWORD PTR [-16+rsp] #3.12
fsubrp st(1), st #3.12
fmulp st(1), st #3.12
fstp QWORD PTR [-16+rsp] #3.12
vmovsd xmm1, QWORD PTR [-16+rsp] #3.12
ret
Can anyone explain what it is doing and whether it is in fact faster than gcc's approach?
I can't benchmark the code myself as I don't have the ICC. The ICC assembly is created using https://godbolt.org/g/ZXZGy2 .