Effective use of vmlaq_s16

Question

When using the vmlaq_s16 intrinsic/VMLA.I16 instruction, the result takes the form of a set of 8 16-bit integers. The multiplies inside the instructions however require the results to be stored in 32-bit integers to protect from overflow.

On Intel processors with SSE2, _mm_madd_epi16 preserves the length of the instruction (8 16-bit integers into 4 32-bit results) by multiplying and adding pairs of consecutive elements of the vectors, i.e.

r0 := (a0 * b0) + (a1 * b1)
r1 := (a2 * b2) + (a3 * b3)
r2 := (a4 * b4) + (a5 * b5)
r3 := (a6 * b6) + (a7 * b7)

Where r0,r1,r2,r3 are all 32-bit, and a0-a7, b0-b7 are all 16-bit elements.

Is there a trick that I'm missing with the vmlaq_s16 instruction that would allow me to still be able to process 8 16-bit elements at once and have results that don't overflow? Or is it the fact that this instruction is just provided for operands that are inherently in the 4-bit range (highly doubtful)?

Thanks!

EDIT: So I just thought about the fact that if vmlaq_s16 sets the overflow register flag(s?) for each of the elements in the result, then it's easy to count the overflows and recover the result.

EDIT 2: For everyone's reference, here's how to load 8 elements and pipeline two long multiply-adds on a 128bit register with intrinsics (proof of concept code that compiles with VS2012 for the ARM target):

signed short vector1[] = {1, 2, 3, 4, 5, 6, 7, 8};
signed short vector2[] = {1, 2, 3, 4, 5, 6, 7, 8};

int16x8_t v1; // = vdupq_n_s16(0);
int16x8_t v2; // = vdupq_n_s16(0);

v1 = vld1q_s16(vector1);
v2 = vld1q_s16(vector2);

int32x4_t sum = vdupq_n_s16(0);
sum = vmlal_s16(sum, v1.s.low64, v2.s.low64);
sum = vmlal_s16(sum, v1.s.high64, v2.s.high64);

printf("sum: %d\n", sum.n128_i32[0]);

Sadly, NEON has no concept of a flags register - the closest you get is the untrapped floating-point exception status bits, which are no help at all here :( — Notlikethat

Notlikethat Notlikethat · Accepted Answer · 2014-07-15T23:45:15

These aren't directly equivalent operations - VMLA multiples two vectors then adds the result elementwise to a 3rd vector, unlike the self-contained half-elementwise-half-horizontal craziness of Intel's PMADDWD. Since that 3rd vector is a regular operand it has to exist in a register, thus there's no room for a 256-bit accumulator.

If you don't want to risk overflow by using VMLA to do 8x16 * 8x16 + 8x16, the alternative is to use VMLAL to do 4x16 * 4x16 + 4x32. The obvious suggestion would be to pipeline pairs of instructions to process 8x16 vectors into two 4x32 accumulators then add them together at the end, but I'll admit I'm not too familiar with intrinsics so I don't know how difficult they would make that (compared to assembly where you can exploit the fact that "64-bit vectors" and "128-bit vectors" are simply interchangable views of the same register file).

Effective use of vmlaq_s16

1 Answers