Intel Intrinsics code optimization

Question

So i'm trying to multiply a constant with short int a[101] with intel intrinsics. I have done it with addition but i can't seem to figure why it wont work with multiplication. Also before we used ints of 32 bits and now we use 16 bit short so we can have double as many values in the intrinsics to fill the 128 bit as far as i understand?

naive example of what im trying to do:

int main(int argc, char **argv){
    short int a[101];
    int len = sizeof(a)/sizeof(short);

    /*Populating array a with values 1 to 101*/

    mult(len, a);

    return 0;
}

int mult(int len, short int *a){
    int result = 0;
    for(int i=0; i<len; i++){
        result += a[i]*20;  
    }
    return result;
}

And my code trying to do the same in intrinsics

/*Same main as before with a short int a[101] containing values 1 to 101*/

int SIMD(int len, short int *a){
    int res;
    int val[4];

    /*Setting constant value to mulitply with*/
    __m128i sum = _mm_set1_epi16(20);
    __m128i s = _mm_setzero_si128( );

    for(int i=0; i<len/4*4; i += 4){
        __m128i vec = _mm_loadu_si128((__m128i *)(a+i));
        s += _mm_mul_epu32(vec,sum);
    }

    _mm_storeu_si128((__m128i*) val, s);
    res += val[0] + val[1] + val[2] + val[3];

    /*Haldeling tail*/
    for(int i=len/4*4; i<len; i++){
        res += a[i];
    }
    return res;
}

So i do get a number out as result, but the number does not match the naive method, i have tried other intrinsics and changing numbers to see if it makes any noticable difference but nothing comes close to the output i expect. The computation time is almost the same as the naive at the moment aswell.

"s = _mm_mul_epu32(vec,sum);" looks odd. Did you mean s += ? — Bjorn A.
Yes was supposed to be +=. But the result is still odd (-1889431611) — Jesper Bang
res is never initialized. There are 8 16-bit integers in one 128-bit SSE block, not 4. So you should use 8 in various places where you have 4. val should be short val[8], and extend the expression val[0] + val[1] + val[2] + val[3] to … + val[7]. _mm_mul_epu32 multiplies 32-bit elements; it should be _mm_mullo_epi16 to multiply 16-bit elements. The statement s = _mm_mul_epu32(vec,sum); only multiplies vec by sum, but you want to multiply them and then add the product to s. — Eric Postpischil
@BjornA.: s += ... will use paddq (64-bit element size), because __m128i in GNU C is defined as typedef long long __m128i __attribute__((vector_size(16), may_alias)). But hey, the OP is already using 32-bit element multiply on 16-bit data, and this error doesn't even cause breakage when there isn't carry from one element to the next (on overflow). — Peter Cordes
To write this portably, use _mm_add_epi16 instead of GNU C / C++ +=. Or better, use _mm_madd_epi16 (pmaddwd) for the multiply to combine horizontal pairs of elements into 32-bit totals without overflow, then use _mm_add_epi32 — Peter Cordes

Eric Postpischil Eric Postpischil · Accepted Answer · 2018-02-23T12:17:19

There are 8 short in one __m128i. So:

for(int i=0; i<len/4*4; i += 4)

should be

for(int i=0; i<len/8*8; i += 8)`

and:

res += val[0] + val[1] + val[2] + val[3];

should be:

res += val[0] + val[1] + val[2] + val[3] + val[4] + val[5] + val[6] + val[7];

and:

for(int i=len/4*4; i<len; i++)

should be:

for(int i=len/8*8; i<len; i++)

In:

s += _mm_mul_epu32(vec,sum);

_mm_mul_epu32 operates on 32-bit elements. It should be:

s += _mm_mullo_epi16(vec, sum);

The object res is not initialized; it should be:

int res = 0;

Here is working code:

#include <stdio.h>
#include <stdlib.h>

#include <immintrin.h>

//  Number of elements in an array.
#define NumberOf(x) (sizeof (x) / sizeof *(x))


//  Compute the result with scalar arithmetic.
static int mult(int len, short int *a)
{
    int result = 0;
    for (size_t i=0; i<len; i++)
    {
        result += a[i]*20;  
    }
    return result;
}


//  Compute the result with SIMD arithmetic.
static int SIMD(int len, short int *a)
{
    //  Initialize the multiplier and the sum.
    __m128i multiplier = _mm_set1_epi16(20);
    __m128i s = _mm_setzero_si128( );

    //  Process blocks of 8 short.
    for (int i=0; i<len/8*8; i += 8)
    {
        __m128i vec = _mm_loadu_si128((__m128i *)(a+i));

        //  Multtiply by multiplier and add to sum.
        s = _mm_add_epi16(s, _mm_mullo_epi16(vec, multiplier));
    }

    //  Store the sum so far so its individual elements can be manipulated.
    short val[8];
    _mm_storeu_si128((__m128i*) val, s);

    //  Add the individual elements.
    int res = 0;
    for (size_t i = 0; i < 8; ++i)
        res += val[i];

    //  Add the elements in the tail.
    for (size_t i = len/8*8; i < len; ++i)
    {
        res += a[i];
    }

    return res;
}



int main(int argc, char **argv)
{
    short int a[96];
    int len = NumberOf(a);

    //  Initiailize a.
    for (size_t i = 0; i < len; ++i)
        a[i] = i+1;

    printf("sum by scalar arithmetic is %d.\n", mult(len, a));
    printf("sum by SIMD arithmetic is %d.\n", SIMD(len, a));

    return 0;
}

Intel Intrinsics code optimization

1 Answers