Intel assembly vs Intrinsics, AVX

Question

I have a simple vector-vector addition algorithm (c = a + b * lambda) written in intel assembly, using AVX instructions. Here is my code:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Dense to dense
;; Uses cache
;; AVX
;; Without tolerances
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

global _denseToDenseAddAVX_cache_64_linux
_denseToDenseAddAVX_cache_64_linux:

push    rbp
mov     rbp, rsp

; rdi: address1
; rsi: address2
; rdx: address3
; rcx: count
; xmm0: lambda

mov     rax, rcx
shr     rcx, 3
and     rax, 0x07

vzeroupper

vmovupd  ymm5, [abs_mask]

sub     rsp, 8
vmovlpd  [rbp - 8], xmm0
vbroadcastsd    ymm7, [rbp - 8]
vmovapd     ymm6, ymm7

cmp     rcx, 0
je      after_loop_denseToDenseAddAVX_cache_64_linux

start_denseToDenseAddAVX_cache_64_linux:

vmovapd  ymm0, [rdi] ; a
vmovapd  ymm1, ymm7
vmulpd   ymm1, [rsi] ; b
vaddpd   ymm0, ymm1  ; ymm0 = c = a + b * lambda
vmovapd  [rdx], ymm0

vmovapd  ymm2, [rdi + 32] ; a
vmovapd  ymm3, ymm6
vmulpd   ymm3, [rsi + 32] ; b
vaddpd   ymm2, ymm3  ; ymm2 = c = a + b * lambda
vmovapd  [rdx + 32], ymm2

add     rdi, 64
add     rsi, 64
add     rdx, 64

loop    start_denseToDenseAddAVX_cache_64_linux

after_loop_denseToDenseAddAVX_cache_64_linux:

cmp     rax, 0
je      end_denseToDenseAddAVX_cache_64_linux

mov     rcx, rax

last_loop_denseToDenseAddAVX_cache_64_linux:

vmovlpd  xmm0, [rdi] ; a
vmovapd  xmm1, xmm7
vmulsd   xmm1, [rsi] ; b
vaddsd   xmm0, xmm1  ; xmm0 = c = a + b * lambda
vmovlpd  [rdx], xmm0

add     rdi, 8
add     rsi, 8
add     rdx, 8

loop    last_loop_denseToDenseAddAVX_cache_64_linux

end_denseToDenseAddAVX_cache_64_linux:

mov     rsp, rbp
pop     rbp
ret

People often suggest me to use intel intrinsics because it is much better and safer. Now I've implemented this algorithm as this:

void denseToDenseAddAVX_cache(const double * __restrict__ a, 
                              const double * __restrict__ b, 
                              double * __restrict__ c, 
                              size_t count, double lambda) {
    const size_t firstCount = count / 8;
    const size_t rem1 = count % 8;
    int i;
    __m256d mul = _mm256_broadcast_sd(&lambda);
    for (i = 0; i < firstCount; i++) {
        // c = a + b * lambda
        __m256d dataA1 = _mm256_load_pd(&a[i * 8]);
        __m256d dataC1 = _mm256_add_pd(dataA1, _mm256_mul_pd(_mm256_load_pd(&b[i * 8]), mul  ));
        _mm256_store_pd(&c[i * 8], dataC1);

        __m256d dataA2 = _mm256_load_pd(&a[i * 8 + 4]);
        __m256d dataC2 = _mm256_add_pd(dataA2, _mm256_mul_pd(_mm256_load_pd(&b[i * 8 + 4]), mul  ));
        _mm256_store_pd(&c[i * 8 + 4], dataC2);
    }
    const size_t secondCount = rem1 / 4;
    const size_t rem2 = rem1 % 4;
    if (secondCount) {
        __m256d dataA = _mm256_load_pd(&a[i * 8]);
        __m256d dataC = _mm256_add_pd(dataA, _mm256_mul_pd(_mm256_load_pd(&b[i * 8]), mul  ));
        _mm256_store_pd(&c[i * 8], dataC);
        i += 4;
    }
    for (; i < count; i++) {
        c[i] = a[i] + b[i] * lambda;
    }
}

My problem is that the assembly version is two times faster than the second one. What is the problem with the c++ version?

As for any performance related question: Are you optimizing (-O3) your C++ code? — 1201ProgramAlarm
Which compiler, what options, and what hardware are you testing on? I assume with the same caller? If both callers are in the same test program, did you do an untimed warmup run first to get page faults out of the way, and get the CPU up to max turbo? — Peter Cordes
If your compiler can't beat this asm, you probably forgot to enable optimization or are testing it wrong. vmovapd ymm1, ymm7 is not needed, use 3-operand AVX instructions like vmulpd ymm1, ymm7, [rsi]. Plus you use the slow-on-Intel loop instruction, bottlenecking this loop at 1 iteration (2 vectors) per 7 clock cycles. agner.org/optimize. I think even if a compiler didn't unroll, and used indexed addressing modes defeating micro-fusion and the port 7 store-AGU on Intel, it would still be at least as good as this. Like GCC does: godbolt.org/z/MHgtfa — Peter Cordes
BTW, you can handle uneven counts by doing the last up-to-8 elements with unaligned vector loads that potentially overlaps data you already loaded. (Ends at the end of the array). At least if your input size is known to be >=4 elements. — Peter Cordes
Also, don't use vmovlpd as a load unless you want to merges into the low element of an existing vector. You want vmovsd to avoid a false dependency and extra ALU uop. — Peter Cordes

Soonts Soonts · Accepted Answer · 2019-12-02T10:44:54

A few things.

I think this is the most important one. Assembly code uses pointers arithmetic. Your C++ code doesn’t, you’re computing indices first, then taking addresses. Compilers often optimize to pointer math but this is not reliable, you better use same pointer math in your C++. What’s even worse, stuff like &a[i * 8 + 4] requires multiple integer instructions. The result in bytes is a+i*64+32, while x86 instructions can only scale integers for free by factors 2, 4 or 8. So the compiler has to emit left shift followed by add to compute the address. This issue doubles the count of instructions in the body of the loop.
C++ uses signed 32-bit integers for loop counters, assembly code uses unsigned 64-bit integers. For performance-critical code it’s often a good idea to use size_t in C++ for loop counters. BTW, if you would have set up “warnings as errors” setting in your C++ compiler, it would refuse to compile, saying something like “signed/unsigned mismatch”.
You have redundant loads in C++. CPUs can do math + one load with a single instruction. To do the same as assembly does, don’t use _mm256_load_pd, cast pointers from const double * into const __m256d*

Here’s slightly simplified example:

void denseToDenseAddAVX( const double *a, const double *b, double *c, size_t count, double lambda )
{
    assert( 0 == (size_t)( a ) % 32 );
    assert( 0 == (size_t)( b ) % 32 );
    assert( 0 == (size_t)( c ) % 32 );

    const double* const aEnd = a + count;
    const double* const aEndAligned = a + ( ( count / 4 ) * 4 );
    const __m256d mul = _mm256_set1_pd( lambda );
    while( a < aEndAligned )
    {
        const __m256d* const av = ( const __m256d* )a;
        const __m256d* const bv = ( const __m256d* )b;
        const __m256d cv = _mm256_add_pd( *av, _mm256_mul_pd( *bv, mul ) );
        _mm256_store_pd( c, cv );
        a += 4;
        b += 4;
        c += 4;
    }
    while( a < aEnd )
    {
        *c = ( *a ) + ( *b ) * lambda;
        a++;
        b++;
        c++;
    }
}

Intel assembly vs Intrinsics, AVX

1 Answers