So i'm trying to multiply a constant with short int a[101] with intel intrinsics. I have done it with addition but i can't seem to figure why it wont work with multiplication. Also before we used ints of 32 bits and now we use 16 bit short so we can have double as many values in the intrinsics to fill the 128 bit as far as i understand?
naive example of what im trying to do:
int main(int argc, char **argv){
short int a[101];
int len = sizeof(a)/sizeof(short);
/*Populating array a with values 1 to 101*/
mult(len, a);
return 0;
}
int mult(int len, short int *a){
int result = 0;
for(int i=0; i<len; i++){
result += a[i]*20;
}
return result;
}
And my code trying to do the same in intrinsics
/*Same main as before with a short int a[101] containing values 1 to 101*/
int SIMD(int len, short int *a){
int res;
int val[4];
/*Setting constant value to mulitply with*/
__m128i sum = _mm_set1_epi16(20);
__m128i s = _mm_setzero_si128( );
for(int i=0; i<len/4*4; i += 4){
__m128i vec = _mm_loadu_si128((__m128i *)(a+i));
s += _mm_mul_epu32(vec,sum);
}
_mm_storeu_si128((__m128i*) val, s);
res += val[0] + val[1] + val[2] + val[3];
/*Haldeling tail*/
for(int i=len/4*4; i<len; i++){
res += a[i];
}
return res;
}
So i do get a number out as result, but the number does not match the naive method, i have tried other intrinsics and changing numbers to see if it makes any noticable difference but nothing comes close to the output i expect. The computation time is almost the same as the naive at the moment aswell.
res
is never initialized. There are 8 16-bit integers in one 128-bit SSE block, not 4. So you should use 8 in various places where you have 4.val
should beshort val[8]
, and extend the expressionval[0] + val[1] + val[2] + val[3]
to… + val[7]
._mm_mul_epu32
multiplies 32-bit elements; it should be_mm_mullo_epi16
to multiply 16-bit elements. The statements = _mm_mul_epu32(vec,sum);
only multipliesvec
bysum
, but you want to multiply them and then add the product tos
. - Eric Postpischils += ...
will usepaddq
(64-bit element size), because__m128i
in GNU C is defined astypedef long long __m128i __attribute__((vector_size(16), may_alias))
. But hey, the OP is already using 32-bit element multiply on 16-bit data, and this error doesn't even cause breakage when there isn't carry from one element to the next (on overflow). - Peter Cordes_mm_add_epi16
instead of GNU C / C+++=
. Or better, use_mm_madd_epi16
(pmaddwd
) for the multiply to combine horizontal pairs of elements into 32-bit totals without overflow, then use_mm_add_epi32
- Peter Cordes