8
votes

I was writing a matrix-vector-multiplication in both SSE and AVX using the following:

for(size_t i=0;i<M;i++) {
    size_t index = i*N;
    __m128 a, x, r1;
    __m128 sum = _mm_setzero_ps();
    for(size_t j=0;j<N;j+=4,index+=4) {
         a = _mm_load_ps(&A[index]);
         x = _mm_load_ps(&X[j]);
         r1 = _mm_mul_ps(a,x);
         sum = _mm_add_ps(r1,sum);
    }
    sum = _mm_hadd_ps(sum,sum);
    sum = _mm_hadd_ps(sum,sum);
    _mm_store_ss(&C[i],sum);
}

I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent instruction to _mm_store_ss(), I used:

_mm_store_ss(&C[i],_mm256_castps256_ps128(sum));

The SSE code gives me a speedup of 3.7 over the serial code. However, the AVX code gives me a speedup of only 4.3 over the serial code.

I know that using SSE with AVX can cause problems but I compiled it with the -mavx' flag using g++ which should remove the SSE opcodes.

I could have also used: _mm256_storeu_ps(&C[i],sum) to do the same thing, but the speedup is the same.

Any insights as to what else I could be doing to improve performance? Can it be related to : performance_memory_bound, though I didn't understand the answer on that thread clearly.

Also, I am not able to use the _mm_fmadd_ps() instruction even by including "immintrin.h" header file. I have both FMA and AVX enabled.

4
It could be that the CPU is just idling while waiting for memory IO. This means that it actually does its computations way faster, but is then stuck waiting for the next chunk of data for a longer time too.Marc Claesen
_mm_store_ss(&C[i],_mm256_castps256_ps128(sum)); is the equivalent instruction in AVX. SSE instructions just operate on the lower 128 bits of the 256 bit AVX register. The cast is only to make the compiler happy and does not use an instruction.Z boson
You should try unrolling your loop at least once.Z boson
"I used a similar method for AVX" - Just to be sure, I assume this similar method to have all 4s appropriately changed to 8s. Just in case.Christian Rau
Well I was doing matrixmatrix not just matrixvector. I did several things. Loop unrolling, loop tiling, AVX, OpenMP. It's actually quite difficult to get more than 50% of the peak flops. I got up to 70% I think eventually which was still lower than MKL but faster than Eigen.Z boson

4 Answers

5
votes

I suggest you reconsider your algorithm. See the discussion Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

You're doing one long dot product and using _mm_hadd_ps per iteration. Instead you should do four dot products at once with SSE (eight with AVX) and only use vertical operators.

You need addition, multiplication, and a broadcast. This can all be done in SSE with _mm_add_ps, _mm_mul_ps, and _mm_shuffle_ps (for the broadcast).

If you already have the transpose of the matrix this is really simple.

But whether you have the transpose or not you need to make your code more cache friendly. To fix this I suggest loop tiling of the matrix. See this discussion What is the fastest way to transpose a matrix in C++? to get an idea on how to do loop tiling.

I would try and get the loop tiling right first before even trying SSE/AVX. The biggest boost I got in my matrix multiplication was not from SIMD or threading it was from loop tiling. I think if you get the cache usage right your AVX code will perform more linear compared to SSE as well.

3
votes

Consider this code. I'm not familiar with the INTEL version, but this is faster than XMMatrixMultiply found in DirectX. It's not about how much math is done per instruction, it's about reducing the instruction count (as long as you are using fast instructions, which this implementation does).

// Perform a 4x4 matrix multiply by a 4x4 matrix 
// Be sure to run in 64 bit mode and set right flags
// Properties, C/C++, Enable Enhanced Instruction, /arch:AVX 
// Having MATRIX on a 32 byte bundry does help performance
struct MATRIX {
    union {
        float  f[4][4];
        __m128 m[4];
        __m256 n[2];
    };
}; MATRIX myMultiply(MATRIX M1, MATRIX M2) {
    MATRIX mResult;
    __m256 a0, a1, b0, b1;
    __m256 c0, c1, c2, c3, c4, c5, c6, c7;
    __m256 t0, t1, u0, u1;

    t0 = M1.n[0];                                                   // t0 = a00, a01, a02, a03, a10, a11, a12, a13
    t1 = M1.n[1];                                                   // t1 = a20, a21, a22, a23, a30, a31, a32, a33
    u0 = M2.n[0];                                                   // u0 = b00, b01, b02, b03, b10, b11, b12, b13
    u1 = M2.n[1];                                                   // u1 = b20, b21, b22, b23, b30, b31, b32, b33

    a0 = _mm256_shuffle_ps(t0, t0, _MM_SHUFFLE(0, 0, 0, 0));        // a0 = a00, a00, a00, a00, a10, a10, a10, a10
    a1 = _mm256_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 0, 0, 0));        // a1 = a20, a20, a20, a20, a30, a30, a30, a30
    b0 = _mm256_permute2f128_ps(u0, u0, 0x00);                      // b0 = b00, b01, b02, b03, b00, b01, b02, b03  
    c0 = _mm256_mul_ps(a0, b0);                                     // c0 = a00*b00  a00*b01  a00*b02  a00*b03  a10*b00  a10*b01  a10*b02  a10*b03
    c1 = _mm256_mul_ps(a1, b0);                                     // c1 = a20*b00  a20*b01  a20*b02  a20*b03  a30*b00  a30*b01  a30*b02  a30*b03

    a0 = _mm256_shuffle_ps(t0, t0, _MM_SHUFFLE(1, 1, 1, 1));        // a0 = a01, a01, a01, a01, a11, a11, a11, a11
    a1 = _mm256_shuffle_ps(t1, t1, _MM_SHUFFLE(1, 1, 1, 1));        // a1 = a21, a21, a21, a21, a31, a31, a31, a31
    b0 = _mm256_permute2f128_ps(u0, u0, 0x11);                      // b0 = b10, b11, b12, b13, b10, b11, b12, b13
    c2 = _mm256_mul_ps(a0, b0);                                     // c2 = a01*b10  a01*b11  a01*b12  a01*b13  a11*b10  a11*b11  a11*b12  a11*b13
    c3 = _mm256_mul_ps(a1, b0);                                     // c3 = a21*b10  a21*b11  a21*b12  a21*b13  a31*b10  a31*b11  a31*b12  a31*b13

    a0 = _mm256_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 2, 2, 2));        // a0 = a02, a02, a02, a02, a12, a12, a12, a12
    a1 = _mm256_shuffle_ps(t1, t1, _MM_SHUFFLE(2, 2, 2, 2));        // a1 = a22, a22, a22, a22, a32, a32, a32, a32
    b1 = _mm256_permute2f128_ps(u1, u1, 0x00);                      // b0 = b20, b21, b22, b23, b20, b21, b22, b23
    c4 = _mm256_mul_ps(a0, b1);                                     // c4 = a02*b20  a02*b21  a02*b22  a02*b23  a12*b20  a12*b21  a12*b22  a12*b23
    c5 = _mm256_mul_ps(a1, b1);                                     // c5 = a22*b20  a22*b21  a22*b22  a22*b23  a32*b20  a32*b21  a32*b22  a32*b23

    a0 = _mm256_shuffle_ps(t0, t0, _MM_SHUFFLE(3, 3, 3, 3));        // a0 = a03, a03, a03, a03, a13, a13, a13, a13
    a1 = _mm256_shuffle_ps(t1, t1, _MM_SHUFFLE(3, 3, 3, 3));        // a1 = a23, a23, a23, a23, a33, a33, a33, a33
    b1 = _mm256_permute2f128_ps(u1, u1, 0x11);                      // b0 = b30, b31, b32, b33, b30, b31, b32, b33
    c6 = _mm256_mul_ps(a0, b1);                                     // c6 = a03*b30  a03*b31  a03*b32  a03*b33  a13*b30  a13*b31  a13*b32  a13*b33
    c7 = _mm256_mul_ps(a1, b1);                                     // c7 = a23*b30  a23*b31  a23*b32  a23*b33  a33*b30  a33*b31  a33*b32  a33*b33

    c0 = _mm256_add_ps(c0, c2);                                     // c0 = c0 + c2 (two terms, first two rows)
    c4 = _mm256_add_ps(c4, c6);                                     // c4 = c4 + c6 (the other two terms, first two rows)
    c1 = _mm256_add_ps(c1, c3);                                     // c1 = c1 + c3 (two terms, second two rows)
    c5 = _mm256_add_ps(c5, c7);                                     // c5 = c5 + c7 (the other two terms, second two rose)

    // Finally complete addition of all four terms and return the results
    mResult.n[0] = _mm256_add_ps(c0, c4);       // n0 = a00*b00+a01*b10+a02*b20+a03*b30  a00*b01+a01*b11+a02*b21+a03*b31  a00*b02+a01*b12+a02*b22+a03*b32  a00*b03+a01*b13+a02*b23+a03*b33
                                                //      a10*b00+a11*b10+a12*b20+a13*b30  a10*b01+a11*b11+a12*b21+a13*b31  a10*b02+a11*b12+a12*b22+a13*b32  a10*b03+a11*b13+a12*b23+a13*b33
    mResult.n[1] = _mm256_add_ps(c1, c5);       // n1 = a20*b00+a21*b10+a22*b20+a23*b30  a20*b01+a21*b11+a22*b21+a23*b31  a20*b02+a21*b12+a22*b22+a23*b32  a20*b03+a21*b13+a22*b23+a23*b33
                                                //      a30*b00+a31*b10+a32*b20+a33*b30  a30*b01+a31*b11+a32*b21+a33*b31  a30*b02+a31*b12+a32*b22+a33*b32  a30*b03+a31*b13+a32*b23+a33*b33
    return mResult;
}
1
votes

As somebody already suggested, add -funroll-loops

Curiously this is not set by default.

Use __restrict for the definition of any float pointers. Use const for constant array references. I don't know if the compiler is smart enough to recognize that the 3 intermediate values inside the loop don't need to be kept alive from iteration to iteration. I would just remove these 3 variables or at least make them local inside the loop (a, x, r1). Index can be declared, where j is declared in order to make it more local. Make certain that M and N are declared as const and if their values are compile-time constants, let the compiler see them.

-1
votes

Once again, if you want to build your own matrix multiplication algorithm, please stop. I remember from Intel's AVX forum, one of their engineer confessed that it took them a very long time to write AVX assemblies to reach AVX theoretical throughput for multiplying two matrices (especially small matrices), because AVX load and store instructions are quite slow at the moment, and not to mention the difficulty to overcome the thread overhead for the parallel version.

Please install Intel Math Kernel Library, spend half an hour reading the manual and code 1 lines to call the library, DONE!