Convert SSE matrix-vector multiplication code to AVX

Question

I'm trying to convert my SSE function to AVX. The function does vector-matrix multiplication, here's my working SSE code:

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    size_t i, j;
    for (i = 0; i < vector_dims; ++i)
    {
        __m128 acc = _mm_setzero_ps();
        for (j = 0; j < vector_dims; j += 4)
        {
            __m128 vec = _mm_load_ps(&v[j]);
            __m128 mat = _mm_load_ps(&m[j + vector_dims * i]);
            //acc = _mm_add_ps(acc, _mm_mul_ps(mat, vec));
            acc = _mm_fmadd_ps(mat, vec, acc);
        }
        acc = _mm_hadd_ps(acc, acc);
        acc = _mm_hadd_ps(acc, acc);
        _mm_store_ss(&result[i], acc);
    }
}

And here's what I've come up with as for AVX:

void multiply_matrix_by_vector_AVX(float* m, float* v, float* result, unsigned const int vector_dims)
{
    size_t i, j;

    for (i = 0; i < vector_dims; ++i)
    {
        __m256 acc = _mm256_setzero_ps();
        for (j = 0; j < vector_dims; j += 8)
        {
            __m256 vec = _mm256_load_ps(&v[j]);
            __m256 mat = _mm256_load_ps(&m[j + vector_dims * i]);
            acc = _mm256_fmadd_ps(mat, vec, acc);
        }
        acc = _mm256_hadd_ps(acc, acc);
        acc = _mm256_hadd_ps(acc, acc);
        acc = _mm256_hadd_ps(acc, acc);
        acc = _mm256_hadd_ps(acc, acc);

        _mm256_store_ps(&result[i], acc);
    }
}

however, the AVX code crashes (Access violation reading location 0xFFFFFFFFFFFFFFFF).

Could anyone help me to make my AVX function work properly?

PS: the sizes of matrixes and vectors that I pass in my functions are always multiples of 8. Also, the arrays I pass to my SSE function are 16-bit aligned (__declspec(align(16))float* = generate_matrix(256);) and the arrays I pass to my AVX function are 32-bit aligned (__declspec(align(32))float* = generate_matrix(256););

harold harold · Accepted Answer · 2015-11-21T17:08:21

Unfortunately using horizontal adds like that does not trivially extend to 256 bit, because the instruction (and most others) is "laned" - it acts like two haddps's in parallel, one on the top half and one on the bottom half, with no mixing, so the bottom and top halves will not get summed together.

Also, it is, of course, still not a packed result, and that packed store there is an aligned store writing to some unaligned address and will fail (that error is a bit weird but whatever).

Anyway let's fix the horizontal sum: (not tested)

// this part still works
acc = _mm256_hadd_ps(acc, acc);
acc = _mm256_hadd_ps(acc, acc);
// this is new
__m128 acc1 = _mm256_extractf128_ps(acc, 0);
__m128 acc2 = _mm256_extractf128_ps(acc, 1);
acc1 = _mm_add_ss(acc1, acc2);
// do scalar store, obviously
_mm_store_ss(&result[i], acc1);

By the way that inner loop needs 10 independent chains (and 10 accumulators) in order to maximize the throughput on Haswell.

Convert SSE matrix-vector multiplication code to AVX

1 Answers