I'm seeing a big difference in performance between code compiled in MSVC (on Windows) and GCC (on Linux) for an Ivy Bridge system. The code does dense matrix multiplication. I'm getting 70% of the peak flops with GCC and only 50% with MSVC. I think I may have isolated the difference to how they both convert the following three intrinsics.
__m256 breg0 = _mm256_loadu_ps(&b[8*i])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0)
GCC does this
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
MSVC does this
vmulps ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps ymm3, ymm1, ymm3
Could somebody please explain to me if and why these two solutions could give such a big difference in performance?
Despite MSVC using one less instruction it ties the load to the mult and maybe that makes it more dependent (maybe the load can't be done out of order)? I mean Ivy Bridge can do one AVX load, one AVX mult, and one AVX add in one clock cycle but this requires each operation to be independent.
Maybe the problem lies elsewhere? You can see the full assembly code for GCC and MSVC for the innermost loop below. You can see the C++ code for the loop here Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp
.L4:
vbroadcastss ymm0, DWORD PTR [rcx+rdx*4]
add rdx, 1
add rax, 256
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
vmovups ymm9, YMMWORD PTR [rax-224]
vmulps ymm9, ymm0, ymm9
vaddps ymm7, ymm7, ymm9
vmovups ymm9, YMMWORD PTR [rax-192]
vmulps ymm9, ymm0, ymm9
vaddps ymm6, ymm6, ymm9
vmovups ymm9, YMMWORD PTR [rax-160]
vmulps ymm9, ymm0, ymm9
vaddps ymm5, ymm5, ymm9
vmovups ymm9, YMMWORD PTR [rax-128]
vmulps ymm9, ymm0, ymm9
vaddps ymm4, ymm4, ymm9
vmovups ymm9, YMMWORD PTR [rax-96]
vmulps ymm9, ymm0, ymm9
vaddps ymm3, ymm3, ymm9
vmovups ymm9, YMMWORD PTR [rax-64]
vmulps ymm9, ymm0, ymm9
vaddps ymm2, ymm2, ymm9
vmovups ymm9, YMMWORD PTR [rax-32]
cmp esi, edx
vmulps ymm0, ymm0, ymm9
vaddps ymm1, ymm1, ymm0
jg .L4
MSVC /FAc /O2 /openmp /arch:AVX ...
vbroadcastss ymm2, DWORD PTR [r10]
lea rax, QWORD PTR [rax+256]
lea r10, QWORD PTR [r10+4]
vmulps ymm1, ymm2, YMMWORD PTR [rax-320]
vaddps ymm3, ymm1, ymm3
vmulps ymm1, ymm2, YMMWORD PTR [rax-288]
vaddps ymm4, ymm1, ymm4
vmulps ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps ymm5, ymm1, ymm5
vmulps ymm1, ymm2, YMMWORD PTR [rax-224]
vaddps ymm6, ymm1, ymm6
vmulps ymm1, ymm2, YMMWORD PTR [rax-192]
vaddps ymm7, ymm1, ymm7
vmulps ymm1, ymm2, YMMWORD PTR [rax-160]
vaddps ymm8, ymm1, ymm8
vmulps ymm1, ymm2, YMMWORD PTR [rax-128]
vaddps ymm9, ymm1, ymm9
vmulps ymm1, ymm2, YMMWORD PTR [rax-96]
vaddps ymm10, ymm1, ymm10
dec rdx
jne SHORT $LL3@AddDot4x4_
EDIT:
I benchmark the code by claculating the total floating point operations as 2.0*n^3
where n is the width of the square matrix and dividing by the time measured with omp_get_wtime()
. I repeat the loop several times. In the output below I repeated it 100 times.
Output from MSVC2012 on an Intel Xeon E5 1620 (Ivy Bridge) turbo for all cores is 3.7 GHz
maximum GFLOPS = 236.8 = (8-wide SIMD) * (1 AVX mult + 1 AVX add) * (4 cores) * 3.7 GHz
n 64, 0.02 ms, GFLOPs 0.001, GFLOPs/s 23.88, error 0.000e+000, efficiency/core 40.34%, efficiency 10.08%, mem 0.05 MB
n 128, 0.05 ms, GFLOPs 0.004, GFLOPs/s 84.54, error 0.000e+000, efficiency/core 142.81%, efficiency 35.70%, mem 0.19 MB
n 192, 0.17 ms, GFLOPs 0.014, GFLOPs/s 85.45, error 0.000e+000, efficiency/core 144.34%, efficiency 36.09%, mem 0.42 MB
n 256, 0.29 ms, GFLOPs 0.034, GFLOPs/s 114.48, error 0.000e+000, efficiency/core 193.37%, efficiency 48.34%, mem 0.75 MB
n 320, 0.59 ms, GFLOPs 0.066, GFLOPs/s 110.50, error 0.000e+000, efficiency/core 186.66%, efficiency 46.67%, mem 1.17 MB
n 384, 1.39 ms, GFLOPs 0.113, GFLOPs/s 81.39, error 0.000e+000, efficiency/core 137.48%, efficiency 34.37%, mem 1.69 MB
n 448, 3.27 ms, GFLOPs 0.180, GFLOPs/s 55.01, error 0.000e+000, efficiency/core 92.92%, efficiency 23.23%, mem 2.30 MB
n 512, 3.60 ms, GFLOPs 0.268, GFLOPs/s 74.63, error 0.000e+000, efficiency/core 126.07%, efficiency 31.52%, mem 3.00 MB
n 576, 3.93 ms, GFLOPs 0.382, GFLOPs/s 97.24, error 0.000e+000, efficiency/core 164.26%, efficiency 41.07%, mem 3.80 MB
n 640, 5.21 ms, GFLOPs 0.524, GFLOPs/s 100.60, error 0.000e+000, efficiency/core 169.93%, efficiency 42.48%, mem 4.69 MB
n 704, 6.73 ms, GFLOPs 0.698, GFLOPs/s 103.63, error 0.000e+000, efficiency/core 175.04%, efficiency 43.76%, mem 5.67 MB
n 768, 8.55 ms, GFLOPs 0.906, GFLOPs/s 105.95, error 0.000e+000, efficiency/core 178.98%, efficiency 44.74%, mem 6.75 MB
n 832, 10.89 ms, GFLOPs 1.152, GFLOPs/s 105.76, error 0.000e+000, efficiency/core 178.65%, efficiency 44.66%, mem 7.92 MB
n 896, 13.26 ms, GFLOPs 1.439, GFLOPs/s 108.48, error 0.000e+000, efficiency/core 183.25%, efficiency 45.81%, mem 9.19 MB
n 960, 16.36 ms, GFLOPs 1.769, GFLOPs/s 108.16, error 0.000e+000, efficiency/core 182.70%, efficiency 45.67%, mem 10.55 MB
n 1024, 17.74 ms, GFLOPs 2.147, GFLOPs/s 121.05, error 0.000e+000, efficiency/core 204.47%, efficiency 51.12%, mem 12.00 MB
_mm256_add_ps(_mm256_mul_ps))
and somebody knows why. I suspect that the MSVC makes the load dependent so it can't be done out of order. But that's just a guess. Somebody knows this much better than me and that's why the question is on SO. – Z boson