Matrix-vector-multiplication in AVX not proportionately faster than in SSE

Question

I was writing a matrix-vector-multiplication in both SSE and AVX using the following:

for(size_t i=0;i<M;i++) {
    size_t index = i*N;
    __m128 a, x, r1;
    __m128 sum = _mm_setzero_ps();
    for(size_t j=0;j<N;j+=4,index+=4) {
         a = _mm_load_ps(&A[index]);
         x = _mm_load_ps(&X[j]);
         r1 = _mm_mul_ps(a,x);
         sum = _mm_add_ps(r1,sum);
    }
    sum = _mm_hadd_ps(sum,sum);
    sum = _mm_hadd_ps(sum,sum);
    _mm_store_ss(&C[i],sum);
}

I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent instruction to _mm_store_ss(), I used:

_mm_store_ss(&C[i],_mm256_castps256_ps128(sum));

The SSE code gives me a speedup of 3.7 over the serial code. However, the AVX code gives me a speedup of only 4.3 over the serial code.

I know that using SSE with AVX can cause problems but I compiled it with the -mavx' flag using g++ which should remove the SSE opcodes.

I could have also used: _mm256_storeu_ps(&C[i],sum) to do the same thing, but the speedup is the same.

Any insights as to what else I could be doing to improve performance? Can it be related to : performance_memory_bound, though I didn't understand the answer on that thread clearly.

Also, I am not able to use the _mm_fmadd_ps() instruction even by including "immintrin.h" header file. I have both FMA and AVX enabled.

It could be that the CPU is just idling while waiting for memory IO. This means that it actually does its computations way faster, but is then stuck waiting for the next chunk of data for a longer time too. — Marc Claesen
_mm_store_ss(&C[i],_mm256_castps256_ps128(sum)); is the equivalent instruction in AVX. SSE instructions just operate on the lower 128 bits of the 256 bit AVX register. The cast is only to make the compiler happy and does not use an instruction. — Z boson
"I used a similar method for AVX" - Just to be sure, I assume this similar method to have all 4s appropriately changed to 8s. Just in case. — Christian Rau
Well I was doing matrixmatrix not just matrixvector. I did several things. Loop unrolling, loop tiling, AVX, OpenMP. It's actually quite difficult to get more than 50% of the peak flops. I got up to 70% I think eventually which was still lower than MKL but faster than Eigen. — Z boson

Z boson Z boson · Accepted Answer · 2013-11-08T09:01:37

I suggest you reconsider your algorithm. See the discussion Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

You're doing one long dot product and using _mm_hadd_ps per iteration. Instead you should do four dot products at once with SSE (eight with AVX) and only use vertical operators.

You need addition, multiplication, and a broadcast. This can all be done in SSE with _mm_add_ps, _mm_mul_ps, and _mm_shuffle_ps (for the broadcast).

If you already have the transpose of the matrix this is really simple.

But whether you have the transpose or not you need to make your code more cache friendly. To fix this I suggest loop tiling of the matrix. See this discussion What is the fastest way to transpose a matrix in C++? to get an idea on how to do loop tiling.

I would try and get the loop tiling right first before even trying SSE/AVX. The biggest boost I got in my matrix multiplication was not from SIMD or threading it was from loop tiling. I think if you get the cache usage right your AVX code will perform more linear compared to SSE as well.

Matrix-vector-multiplication in AVX not proportionately faster than in SSE

4 Answers