2
votes

I want to do the following: I have 8 values (8 x 1Byte) in a Neon D-Register (=64Bit). Now I need to shift every value 3 to the left, but I dont want to lose any Bits. Afterwards I need to add to every value in the vector the same 32Bit value.

As I understood it i can use the VQSHL instruction to put the result in 2 D-Registers if it overflows? How do I know if an overflow occured and guarantee/force that all of my data are in the new registers?

Also could you help me with some Code for the shift and Add part?

Example Code:

out0 = CONSTANT_32BIT + ( input0 << 3)

out1 = CONSTANT_32BIT + ( input1 << 3)

out_n = CONSTANT_32BIT + ( input_n << 3)

So in theory i could do 8 or 16 of these instructions in parallel using Neon registers?

Target is an ARM Cortex-A9 if this is important.

2

2 Answers

3
votes

You could do something like this (untested code, but should give you some idea of how to do it):

//Assumes signed ints
//d0: 8 input bytes
//q3: contains four copies of the 32-bit constant
//Perform shift and extend to 16-bit elements
vshll.s8 q0, d0, #3
//Extend 16-bit elements to 32-bit elements and add the 32-bit constants
vaddw.s16 q1, q3, d0
vaddw.s16 q2, q3, d1
//q1 now contains first four values, q2 the last four
2
votes

VQSHL is a saturating shift. That is, it will not let the lanes overflow, and if they do they'll saturate to the maximum possible value. If this is the desired behavior then this will work for you. If saturation occured the processor will set the FPSCR.QC (cumulative saturation flag).

From your description it sounds like you don't want an overflow behavior. If you plan to add a 32 bit value to each 8 bit value the result will generally not fit in an 8 bit register. Perhaps you should consider loading your 8 bit values into a wider register. E.g. as 4 32-bit lanes. You can use the multiple element form of VLD to help you load the 8-bit values into NEON registers, something like VLD2.8 {d0[0],d1[0],d2[0],d3[0]}, [r0] will load the even indices and then you can load the odd ones. Another option there is to use VZIP.