1
votes

I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.

However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."

I think the error would generate if the matrix dimension isn't multiples of 4.

I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.

But, here is my question. How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???

Here is my code.

#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"

void mv(double *a,double *b,double *c, int m, int n, int l)
{
    __m256d va,vb,vc;
    int k;
    int i;
    for (k = 0; k < l; k++) {
        vb = _mm256_broadcast_sd(&b[k]);
        for (i = 0; i < m; i+=4) {
            va = _mm256_loadu_pd(&a[m*k+i]);
            vc = _mm256_loadu_pd(&c[i]);
            vc = _mm256_fmadd_pd(vc, va, vb);
            _mm256_storeu_pd( &c[i], vc );
        }
    }
}
int main(int argc, char* argv[]) {

    // set variables
    int m;
    double* a;
    double* b;
    double* c;
    int i;
    int temp=0;
    struct timespec startTime, endTime;

    m=9;
    // main program

    // set vector or matrix
    a=(double *)malloc(sizeof(double) * m*m);
    b=(double *)malloc(sizeof(double) * m*1);
    c=(double *)malloc(sizeof(double) * m*1);

    for (i=0;i<m;i++) {
        a[i]=1;
        b[i]=1;
        c[i]=0.0;
    }
    for (i=m;i<m*m;i++) {
        a[i]=1;
    }

    // check start time
    clock_gettime(CLOCK_REALTIME, &startTime);
    mv(a, b, c, m, 1, m);
    // check end time
    clock_gettime(CLOCK_REALTIME, &endTime);

    free(a);
    free(b);
    free(c);
    return 0;
}
1

1 Answers

2
votes

You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.

You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.

A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.

In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.

But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.

// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested

void mv(double *a,double *b,double *c, int m, int n, int l)
{
   /*  the loop below should actually work for m=1..3, but a separate strategy might be good.
    if (m < 4) {
        // maybe check m >= 2 and use __m128 vectors?
        // or vectorize differently?
    }
   */


    for (int k = 0; k < l; k++) {
        __m256 vb = _mm256_broadcast_sd(&b[k]);
        int i;
        for (i = 0; i < (m-3); i+=4) {
            __m256d va = _mm256_loadu_pd(&a[m*k+i]);
            __m256d vc = _mm256_loadu_pd(&c[i]);
                    vc = _mm256_fmadd_pd(vc, va, vb);
            _mm256_storeu_pd( &c[i], vc );
        }
        if (i<(m-1)) {
            __m128d lasta = _mm_loadu_pd(&a[m*k+i]);
            __m128d lastc = _mm_loadu_pd(&c[i]);
                    lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
                _mm_storeu_pd( &c[i], lastc );
            // i+=2;  // last element only checks m odd/even, doesn't use i
        }
        // if (i<m)
        if (m&1) {
            // odd number of elements, do the last non-vector one
            c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
        }

    }
}

I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).

Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.


You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.


Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.