When using the vmlaq_s16 intrinsic/VMLA.I16 instruction, the result takes the form of a set of 8 16-bit integers. The multiplies inside the instructions however require the results to be stored in 32-bit integers to protect from overflow.
On Intel processors with SSE2, _mm_madd_epi16 preserves the length of the instruction (8 16-bit integers into 4 32-bit results) by multiplying and adding pairs of consecutive elements of the vectors, i.e.
r0 := (a0 * b0) + (a1 * b1)
r1 := (a2 * b2) + (a3 * b3)
r2 := (a4 * b4) + (a5 * b5)
r3 := (a6 * b6) + (a7 * b7)
Where r0,r1,r2,r3 are all 32-bit, and a0-a7, b0-b7 are all 16-bit elements.
Is there a trick that I'm missing with the vmlaq_s16 instruction that would allow me to still be able to process 8 16-bit elements at once and have results that don't overflow? Or is it the fact that this instruction is just provided for operands that are inherently in the 4-bit range (highly doubtful)?
Thanks!
EDIT: So I just thought about the fact that if vmlaq_s16 sets the overflow register flag(s?) for each of the elements in the result, then it's easy to count the overflows and recover the result.
EDIT 2: For everyone's reference, here's how to load 8 elements and pipeline two long multiply-adds on a 128bit register with intrinsics (proof of concept code that compiles with VS2012 for the ARM target):
signed short vector1[] = {1, 2, 3, 4, 5, 6, 7, 8};
signed short vector2[] = {1, 2, 3, 4, 5, 6, 7, 8};
int16x8_t v1; // = vdupq_n_s16(0);
int16x8_t v2; // = vdupq_n_s16(0);
v1 = vld1q_s16(vector1);
v2 = vld1q_s16(vector2);
int32x4_t sum = vdupq_n_s16(0);
sum = vmlal_s16(sum, v1.s.low64, v2.s.low64);
sum = vmlal_s16(sum, v1.s.high64, v2.s.high64);
printf("sum: %d\n", sum.n128_i32[0]);